From mdoerr at openjdk.org Tue Oct 4 09:37:57 2022 From: mdoerr at openjdk.org (Martin Doerr) Date: Tue, 4 Oct 2022 09:37:57 GMT Subject: RFR: 8293782: Shenandoah: some tests failed on lock rank check In-Reply-To: References: Message-ID: On Wed, 14 Sep 2022 07:01:52 GMT, Tongbao Zhang wrote: > After [JDK-8290025](https://bugs.openjdk.org/browse/JDK-8290025), some tests using ShenandoahGC failed on the lock rank check between AdapterHandlerLibrary_lock and ShenandoahRequestedGC_lock > > Symptom > > # > # A fatal error has been detected by the Java Runtime Environment: > # > # Internal Error (/data1/ws/jdk/src/hotspot/share/runtime/mutex.cpp:454), pid=2018566, tid=2022220 > # assert(false) failed: Attempting to acquire lock ShenandoahRequestedGC_lock/safepoint-1 out of order with lock AdapterHandlerLibrary_lock/safepoint-1 -- possible deadlock > # > # JRE version: OpenJDK Runtime Environment (20.0) (slowdebug build 20-internal-adhoc.root.jdk) > # Java VM: OpenJDK 64-Bit Server VM (slowdebug 20-internal-adhoc.root.jdk, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, shenandoah gc, linux-amd64) > # Problematic frame: > # V [libjvm.so+0x106fd6a] Mutex::check_rank(Thread*)+0x426 We have tested it for a couple of days and there were no new failures. LGTM. ------------- Marked as reviewed by mdoerr (Reviewer). PR: https://git.openjdk.org/jdk/pull/10264 From shade at openjdk.org Tue Oct 4 10:50:14 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 4 Oct 2022 10:50:14 GMT Subject: RFR: 8293782: Shenandoah: some tests failed on lock rank check In-Reply-To: References: Message-ID: On Wed, 14 Sep 2022 07:01:52 GMT, Tongbao Zhang wrote: > After [JDK-8290025](https://bugs.openjdk.org/browse/JDK-8290025), some tests using ShenandoahGC failed on the lock rank check between AdapterHandlerLibrary_lock and ShenandoahRequestedGC_lock > > Symptom > > # > # A fatal error has been detected by the Java Runtime Environment: > # > # Internal Error (/data1/ws/jdk/src/hotspot/share/runtime/mutex.cpp:454), pid=2018566, tid=2022220 > # assert(false) failed: Attempting to acquire lock ShenandoahRequestedGC_lock/safepoint-1 out of order with lock AdapterHandlerLibrary_lock/safepoint-1 -- possible deadlock > # > # JRE version: OpenJDK Runtime Environment (20.0) (slowdebug build 20-internal-adhoc.root.jdk) > # Java VM: OpenJDK 64-Bit Server VM (slowdebug 20-internal-adhoc.root.jdk, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, shenandoah gc, linux-amd64) > # Problematic frame: > # V [libjvm.so+0x106fd6a] Mutex::check_rank(Thread*)+0x426 So we can now enter this code when holding `AdapterHandlerLibrary_lock`, which has a rank of `safepoint-1`. These locks should probably match the rank of `Heap_lock`, which is `safepoint-2` now. Please update `_alloc_failure_waiters_lock` rank as well. ------------- Changes requested by shade (Reviewer). PR: https://git.openjdk.org/jdk/pull/10264 From nick.gasson at arm.com Tue Oct 4 12:56:12 2022 From: nick.gasson at arm.com (Nick Gasson) Date: Tue, 04 Oct 2022 13:56:12 +0100 Subject: Improving the scalability of the evac OOM protocol Message-ID: Hi, I've been running SPECjbb with Shenandoah on some large multi-socket Arm systems and I noticed the concurrent evacuation OOM protocol is a bit of a bottleneck. The problem here is that we have a single variable, _threads_in_evac, shared between all threads. To enter the protocol we do a CAS to increment the counter and to leave we do an atomic decrement. For the GC threads this isn't really an issue as they only enter/leave once per cycle, but Java threads have to enter/leave every time they help evacuate an object on the load barrier slow path. This means _threads_in_evac is very heavily contended and we effectively serialise Java thread execution through access to this variable: I counted several million CAS failures per second in ShenandoahEvacOOMHandler::register_thread() on one Arm N1 system while running SPECjbb. This is especially problematic on multi-socket systems where the communication overhead of the cache coherency protocol can be high. I tried fixing this in a fairly simple way by replicating the counter N times on separate cache lines (N=64, somewhat arbitrarily). See the draft patch below: https://github.com/nick-arm/jdk/commit/ca78e77f0c6 Each thread hashes to a particular counter based on its Thread*. To signal an OOM we CAS in OOM_MARKER_MASK on every counter and then in wait_for_no_evac_threads() we wait for every counter to go to zero (and also to see OOM_MARKER_MASK set in that counter). I think this is safe and race-free based on the fact that, once OOM_MARKER_MASK is set, the counter can only ever decrease. So once we've seen a particular counter go to zero we know that the value will never change except when clear() is called at a safepoint. This means we can just iterate over all the counters, and if we see that they are all zero, then we know no more threads are inside or can enter the evacuation path. On a 160-core dual-socket Arm N1 system this improves SPECjbb max-jOPS by ~8% and critical-jOPS by ~98% (!), averaged over 10 runs. On a 32-core dual-socket Xeon system I get +0.4% max-jOPS and +43% critical-jOPS. There's also some benefit on single-socket systems: with AWS c7g.16xlarge I see +0.3% max-jOPS and +3% critical-jOPS. I've also tested SPECjbb on a fastdebug build with -XX:+ShenandoahOOMDuringEvacALot and didn't see any errors. I experimented with taking this to its logical conclusion and giving each thread its own counter in ShenandoahThreadLocalData, but it's difficult to avoid races with thread creation and this simple approach seems to give most of the benefit anyway. Any thoughts on this? -- Thanks, Nick From shade at redhat.com Tue Oct 4 14:42:11 2022 From: shade at redhat.com (Aleksey Shipilev) Date: Tue, 4 Oct 2022 16:42:11 +0200 Subject: Improving the scalability of the evac OOM protocol In-Reply-To: References: Message-ID: <6b3eface-94b9-744d-bc7d-ec5bb4b05c90@redhat.com> On 10/4/22 14:56, Nick Gasson wrote: > I tried fixing this in a fairly simple way by replicating the counter N > times on separate cache lines (N=64, somewhat arbitrarily). See the > draft patch below: > > https://github.com/nick-arm/jdk/commit/ca78e77f0c6 > > Each thread hashes to a particular counter based on its Thread*. Yes, stripped counter works here fine. > I've also tested SPECjbb on a fastdebug build with > -XX:+ShenandoahOOMDuringEvacALot and didn't see any errors. make test TEST=hotspot_gc_shenandoah exercises evac paths a lot, consider running it on affected platforms. > I experimented with taking this to its logical conclusion and giving > each thread its own counter in ShenandoahThreadLocalData, but it's > difficult to avoid races with thread creation and this simple approach > seems to give most of the benefit anyway. > > Any thoughts on this? Looks very good, please PR this. There are minor improvements we can do to this patch. -- Thanks, -Aleksey From ngasson at openjdk.org Wed Oct 5 11:20:53 2022 From: ngasson at openjdk.org (Nick Gasson) Date: Wed, 5 Oct 2022 11:20:53 GMT Subject: RFR: 8294775: Shenandoah: reduce contention on _threads_in_evac Message-ID: The idea here is to reduce contention on the shared `_threads_in_evac` counter by splitting its state over multiple independent cache lines. Each thread hashes to one particular counter based on its `Thread*`. This helps improve throughput of concurrent evacuation where many Java threads may be attempting to update this counter on the load barrier slow path. See this earlier thread for details and SPECjbb results: https://mail.openjdk.org/pipermail/shenandoah-dev/2022-October/017494.html Also tested `hotspot_gc_shenandoah` on x86 and AArch64. ------------- Commit messages: - 8294775: Shenandoah: reduce contention on _threads_in_evac Changes: https://git.openjdk.org/jdk/pull/10573/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10573&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8294775 Stats: 87 lines in 4 files changed: 62 ins; 6 del; 19 mod Patch: https://git.openjdk.org/jdk/pull/10573.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10573/head:pull/10573 PR: https://git.openjdk.org/jdk/pull/10573 From ngasson at openjdk.org Wed Oct 5 11:20:53 2022 From: ngasson at openjdk.org (Nick Gasson) Date: Wed, 5 Oct 2022 11:20:53 GMT Subject: RFR: 8294775: Shenandoah: reduce contention on _threads_in_evac In-Reply-To: References: Message-ID: On Wed, 5 Oct 2022 11:10:29 GMT, Nick Gasson wrote: > The idea here is to reduce contention on the shared `_threads_in_evac` counter by splitting its state over multiple independent cache lines. Each thread hashes to one particular counter based on its `Thread*`. This helps improve throughput of concurrent evacuation where many Java threads may be attempting to update this counter on the load barrier slow path. > > See this earlier thread for details and SPECjbb results: https://mail.openjdk.org/pipermail/shenandoah-dev/2022-October/017494.html > > Also tested `hotspot_gc_shenandoah` on x86 and AArch64. src/hotspot/share/gc/shenandoah/shenandoahEvacOOMHandler.inline.hpp line 35: > 33: > 34: void ShenandoahEvacOOMHandler::enter_evacuation(Thread* thr) { > 35: jint threads_in_evac = Atomic::load_acquire(&_threads_in_evac); This load seems to be redundant. I don't think it has any ordering effects and we will load it again immediately either below or in `register_thread()`. ------------- PR: https://git.openjdk.org/jdk/pull/10573 From rkennke at openjdk.org Thu Oct 6 07:47:02 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 07:47:02 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking Message-ID: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. This change enables to simplify (and speed-up!) a lot of code: - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR ### Benchmarks All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. #### DaCapo/AArch64 Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? benchmark | baseline | fast-locking | % | size -- | -- | -- | -- | -- avrora | 27859 | 27563 | 1.07% | large batik | 20786 | 20847 | -0.29% | large biojava | 27421 | 27334 | 0.32% | default eclipse | 59918 | 60522 | -1.00% | large fop | 3670 | 3678 | -0.22% | default graphchi | 2088 | 2060 | 1.36% | default h2 | 297391 | 291292 | 2.09% | huge jme | 8762 | 8877 | -1.30% | default jython | 18938 | 18878 | 0.32% | default luindex | 1339 | 1325 | 1.06% | default lusearch | 918 | 936 | -1.92% | default pmd | 58291 | 58423 | -0.23% | large sunflow | 32617 | 24961 | 30.67% | large tomcat | 25481 | 25992 | -1.97% | large tradebeans | 314640 | 311706 | 0.94% | huge tradesoap | 107473 | 110246 | -2.52% | huge xalan | 6047 | 5882 | 2.81% | default zxing | 970 | 926 | 4.75% | default #### DaCapo/x86_64 The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. benchmark | baseline | fast-Locking | % | size -- | -- | -- | -- | -- avrora | 127690 | 126749 | 0.74% | large batik | 12736 | 12641 | 0.75% | large biojava | 15423 | 15404 | 0.12% | default eclipse | 41174 | 41498 | -0.78% | large fop | 2184 | 2172 | 0.55% | default graphchi | 1579 | 1560 | 1.22% | default h2 | 227614 | 230040 | -1.05% | huge jme | 8591 | 8398 | 2.30% | default jython | 13473 | 13356 | 0.88% | default luindex | 824 | 813 | 1.35% | default lusearch | 962 | 968 | -0.62% | default pmd | 40827 | 39654 | 2.96% | large sunflow | 53362 | 43475 | 22.74% | large tomcat | 27549 | 28029 | -1.71% | large tradebeans | 190757 | 190994 | -0.12% | huge tradesoap | 68099 | 67934 | 0.24% | huge xalan | 7969 | 8178 | -2.56% | default zxing | 1176 | 1148 | 2.44% | default #### Renaissance/AArch64 This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. benchmark | baseline | fast-locking | % -- | -- | -- | -- AkkaUct | 2558.832 | 2513.594 | 1.80% Reactors | 14715.626 | 14311.246 | 2.83% Als | 1851.485 | 1869.622 | -0.97% ChiSquare | 1007.788 | 1003.165 | 0.46% GaussMix | 1157.491 | 1149.969 | 0.65% LogRegression | 717.772 | 733.576 | -2.15% MovieLens | 7916.181 | 8002.226 | -1.08% NaiveBayes | 395.296 | 386.611 | 2.25% PageRank | 4294.939 | 4346.333 | -1.18% FjKmeans | 519.2 | 498.357 | 4.18% FutureGenetic | 2578.504 | 2589.255 | -0.42% Mnemonics | 4898.886 | 4903.689 | -0.10% ParMnemonics | 4260.507 | 4210.121 | 1.20% Scrabble | 139.37 | 138.312 | 0.76% RxScrabble | 320.114 | 322.651 | -0.79% Dotty | 1056.543 | 1068.492 | -1.12% ScalaDoku | 3443.117 | 3449.477 | -0.18% ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% FinagleChirper | 6814.192 | 6853.38 | -0.57% FinagleHttp | 4762.902 | 4807.564 | -0.93% #### Renaissance/x86_64 benchmark | baseline | fast-locking | % -- | -- | -- | -- AkkaUct | 1117.185 | 1116.425 | 0.07% Reactors | 11561.354 | 11812.499 | -2.13% Als | 1580.838 | 1575.318 | 0.35% ChiSquare | 459.601 | 467.109 | -1.61% GaussMix | 705.944 | 685.595 | 2.97% LogRegression | 659.944 | 656.428 | 0.54% MovieLens | 7434.303 | 7592.271 | -2.08% NaiveBayes | 413.482 | 417.369 | -0.93% PageRank | 3259.233 | 3276.589 | -0.53% FjKmeans | 946.429 | 938.991 | 0.79% FutureGenetic | 1760.672 | 1815.272 | -3.01% Scrabble | 147.996 | 150.084 | -1.39% RxScrabble | 177.755 | 177.956 | -0.11% Dotty | 673.754 | 683.919 | -1.49% ScalaKmeans | 165.376 | 168.925 | -2.10% ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. ### Testing - [x] tier1 (x86_64, aarch64, x86_32) - [x] tier2 (x86_64, aarch64) - [x] tier3 (x86_64, aarch64) - [x] tier4 (x86_64, aarch64) ------------- Commit messages: - Merge tag 'jdk-20+17' into fast-locking - Fix OSR packing in AArch64, part 2 - Fix OSR packing in AArch64 - Merge remote-tracking branch 'upstream/master' into fast-locking - Fix register in interpreter unlock x86_32 - Support unstructured locking in interpreter (x86 parts) - Support unstructured locking in interpreter (aarch64 and shared parts) - Merge branch 'master' into fast-locking - Merge branch 'master' into fast-locking - Added test for hand-over-hand locking - ... and 17 more: https://git.openjdk.org/jdk/compare/79ccc791...3ed51053 Changes: https://git.openjdk.org/jdk/pull/9680/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=9680&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8291555 Stats: 3660 lines in 127 files changed: 650 ins; 2481 del; 529 mod Patch: https://git.openjdk.org/jdk/pull/9680.diff Fetch: git fetch https://git.openjdk.org/jdk pull/9680/head:pull/9680 PR: https://git.openjdk.org/jdk/pull/9680 From stuefe at openjdk.org Thu Oct 6 07:47:02 2022 From: stuefe at openjdk.org (Thomas Stuefe) Date: Thu, 6 Oct 2022 07:47:02 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: On Thu, 28 Jul 2022 19:58:34 GMT, Roman Kennke wrote: > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 519.2 | 498.357 | 4.18% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) When I run renaissance philosophers benchmark (no arguments, just the default settings) on my 12 core machine the VM intermittently hangs after the benchmark is done. Always, two threads keep running at 100% CPU. I have been able to attach gdb once and we were in a tight loop in (gdb) bt #0 Atomic::PlatformLoad<8ul>::operator() (dest=0x7f991c119e80, this=) at src/hotspot/share/runtime/atomic.hpp:614 #1 Atomic::LoadImpl, void>::operator() (dest=0x7f991c119e80, this=) at src/hotspot/share/runtime/atomic.hpp:392 #2 Atomic::load (dest=0x7f991c119e80) at src/hotspot/share/runtime/atomic.hpp:615 #3 ObjectMonitor::owner_raw (this=0x7f991c119e40) at src/hotspot/share/runtime/objectMonitor.inline.hpp:66 #4 ObjectMonitor::owner (this=0x7f991c119e40) at src/hotspot/share/runtime/objectMonitor.inline.hpp:61 #5 ObjectSynchronizer::monitors_iterate (thread=0x7f9a30027230, closure=) at src/hotspot/share/runtime/synchronizer.cpp:983 #6 ObjectSynchronizer::release_monitors_owned_by_thread (current=current at entry=0x7f9a30027230) at src/hotspot/share/runtime/synchronizer.cpp:1492 #7 0x00007f9a351bc320 in JavaThread::exit (this=this at entry=0x7f9a30027230, destroy_vm=destroy_vm at entry=false, exit_type=exit_type at entry=JavaThread::jni_detach) at src/hotspot/share/runtime/javaThread.cpp:851 #8 0x00007f9a352445ca in jni_DetachCurrentThread (vm=) at src/hotspot/share/prims/jni.cpp:3962 #9 0x00007f9a35f9ac7e in JavaMain (_args=) at src/java.base/share/native/libjli/java.c:555 #10 0x00007f9a35f9e30d in ThreadJavaMain (args=) at src/java.base/unix/native/libjli/java_md.c:650 #11 0x00007f9a35d47609 in start_thread (arg=) at pthread_create.c:477 #12 0x00007f9a35ea3133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 in one thread. Which points to a misformed monitor list. I tried to reproduce it with a debug build, but no such luck. I was able to reproduce it once again with a release build. I'll see if I can find out more. Happens when the main thread detaches itself upon VM exit. VM attempts to release OMs that are owned by the finished main thread (side note: if that is the sole surviving thread, maybe that step could be skipped?). That happens before DestroyVM, so OM final audit did not yet run. Problem here is the OM in use list is circular (and very big, ca 11mio entries). I was able to reproduce it with a fastdebug build in 1 out of 5-6 runs. Also with less benchmark cycles (-r 3). Offlist questions from Roman: -"Does it really not happen with Stock?" no, I could not reproduce it with stock VM (built from f5d1b5bda27c798347ae278cbf69725ed4be895c, the commit preceding the PR) -"Do we now have more OMs than before?" I cannot see that effect. Running philosophers with -r 3 causes the VM in the end to have between 800k and ~2mio open OMs *if the error does not happen*, no difference between stock and PR VM. In cases where the PR-VM hangs we have a lot more, as I wrote, about 11-12mio OMs. ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 07:47:03 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 07:47:03 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: On Wed, 3 Aug 2022 07:17:51 GMT, Thomas Stuefe wrote: > Happens when the main thread detaches itself upon VM exit. VM attempts to release OMs that are owned by the finished main thread (side note: if that is the sole surviving thread, maybe that step could be skipped?). That happens before DestroyVM, so OM final audit did not yet run. > > Problem here is the OM in use list is circular (and very big, ca 11mio entries). > > I was able to reproduce it with a fastdebug build in 1 out of 5-6 runs. Also with less benchmark cycles (-r 3). Hi Thomas, thanks for testing and reporting the issue. I just pushed an improvement (and simplification) of the monitor-enter-inflate path, and cannot seem to reproduce the problem anymore. Can you please try again with the latest change? ------------- PR: https://git.openjdk.org/jdk/pull/9680 From stuefe at openjdk.org Thu Oct 6 07:47:04 2022 From: stuefe at openjdk.org (Thomas Stuefe) Date: Thu, 6 Oct 2022 07:47:04 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: <87vYXa_Uu88sb8ldFeGdHfeqPCMPxhhzqbVOooXle7A=.09d21ecc-8910-464d-b164-88b8322ebd34@github.com> On Sun, 7 Aug 2022 12:50:01 GMT, Roman Kennke wrote: > > Happens when the main thread detaches itself upon VM exit. VM attempts to release OMs that are owned by the finished main thread (side note: if that is the sole surviving thread, maybe that step could be skipped?). That happens before DestroyVM, so OM final audit did not yet run. > > Problem here is the OM in use list is circular (and very big, ca 11mio entries). > > I was able to reproduce it with a fastdebug build in 1 out of 5-6 runs. Also with less benchmark cycles (-r 3). > > Hi Thomas, thanks for testing and reporting the issue. I just pushed an improvement (and simplification) of the monitor-enter-inflate path, and cannot seem to reproduce the problem anymore. Can you please try again with the latest change? New version ran for 30 mins without crashing. Not a solid proof, but its better :-) ------------- PR: https://git.openjdk.org/jdk/pull/9680 From dholmes at openjdk.org Thu Oct 6 07:47:05 2022 From: dholmes at openjdk.org (David Holmes) Date: Thu, 6 Oct 2022 07:47:05 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: On Thu, 28 Jul 2022 19:58:34 GMT, Roman Kennke wrote: > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 519.2 | 498.357 | 4.18% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) The bar for acceptance for a brand new locking scheme with no fallback is extremely high and needs a lot of bake time and broad performance measurements, to watch for pathologies. That bar is lower if the scheme can be reverted to the old code if needed; and even lower still if the scheme is opt-in in the first place. For Java Object Monitors I made the new mechanism opt-in so the same could be done here. Granted it is not a trivial effort to do that, but I think a phased approach to transition to the new scheme is essential. It could be implemented as an experimental feature initially. I am not aware, please refresh my memory if you know different, of any core hotspot subsystem just being replaced in one fell swoop in one single release. Yes this needs a lot of testing but customers are not beta-testers. If this goes into a release on by default then there must be a way for customers to turn it off. UseHeavyMonitors is not a fallback as it is not for production use itself. So the new code has to co-exist along-side the old code as we make a transition across 2-3 releases. And yes that means a double-up on some testing as we already do for many things. Any fast locking scheme benefits the uncontended sync case. So if you have a lot of contention and therefore a lot of inflation, the fast locking won't show any benefit. What "modern workloads" are you using to measure this? We eventually got rid of biased-locking because it no longer showed any benefit, so it is possible that fast locking (of whichever form) could go the same way. And we may have moved past heavy use of synchronized in general for that matter, especially as Loom instigated many changes over to java.util.concurrent locks. Is UseHeavyMonitors in good enough shape to reliably be used for benchmark comparisons? I don't have github notification enabled so I missed this discussion. The JVMS permits lock A, lock B, unlock A, unlock B, in bytecode - i.e it passes verification and it does not violate the structured locking rules. It probably also passes verification if there is no exception table entries such that the unlocks are guaranteed to happen - regardless of the order. IIUC from above the VM will actually unlock all monitors for which there is a lock-record in the activation when the activation returns. The order in which it does that may be different to how the program would have done it but I don't see how that makes any difference to anything. ------------- PR: https://git.openjdk.org/jdk/pull/9680 From stuefe at openjdk.org Thu Oct 6 07:47:07 2022 From: stuefe at openjdk.org (Thomas Stuefe) Date: Thu, 6 Oct 2022 07:47:07 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: <8MGsPdlBSWGR-pgF8_fLo_mez67z7nHWXg8UOcjJxIY=.38bd9c0f-3ba0-4ebe-867d-b54608f01e63@github.com> References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> <8MGsPdlBSWGR-pgF8_fLo_mez67z7nHWXg8UOcjJxIY=.38bd9c0f-3ba0-4ebe-867d-b54608f01e63@github.com> Message-ID: On Mon, 8 Aug 2022 13:45:06 GMT, Roman Kennke wrote: > The bar for acceptance for a brand new locking scheme with no fallback is extremely high and needs a lot of bake time and broad performance measurements, to watch for pathologies. That bar is lower if the scheme can be reverted to the old code if needed; and even lower still if the scheme is opt-in in the first place. For Java Object Monitors I made the new mechanism opt-in so the same could be done here. Granted it is not a trivial effort to do that, but I think a phased approach to transition to the new scheme is essential. It could be implemented as an experimental feature initially. I fully agree that have to be careful, but I share Roman's viewpoint. If this work is something we want to happen and which is not in doubt in principle, then we also want the broadest possible test front. In my experience, opt-in coding is tested poorly. A runtime switch is fine as an emergency measure when you have customer problems, but then both standard and fallback code paths need to be very well tested. With something as ubiquitous as locking this would mean running almost the full test set with and without the new fast locking mechanism, and that is not feasible. Or even if it is, not practical: the cycles are better invested in hardening out the new locking mechanism. And arguably, we already have an opt-out mechanism in the form of UseHeavyMonitors. It's not ideal, but as Roman wrote, in most scenarios, this does not show any regression. So in a pinch, it could serve as a short-term solution if the new fast lock mechanism is broken. In my opinion, the best time for such an invasive change is the beginning of the development cycle for a non-LTS-release, like now. And we don't have to push the PR in a rush, we can cook it in its branch and review it very thoroughly. Cheers, Thomas ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rehn at openjdk.org Thu Oct 6 07:47:07 2022 From: rehn at openjdk.org (Robbin Ehn) Date: Thu, 6 Oct 2022 07:47:07 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: On Thu, 28 Jul 2022 19:58:34 GMT, Roman Kennke wrote: > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 519.2 | 498.357 | 4.18% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) I ran some test locally, 4 JDI fails and 3 JVM TI, all seems to fail in: #7 0x00007f7cefc5c1ce in Thread::is_lock_owned (this=this at entry=0x7f7ce801dd90, adr=adr at entry=0x1 ) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/thread.cpp:549 #8 0x00007f7cef22c062 in JavaThread::is_lock_owned (this=0x7f7ce801dd90, adr=0x1 ) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/javaThread.cpp:979 #9 0x00007f7cefc79ab0 in Threads::owning_thread_from_monitor_owner (t_list=, owner=owner at entry=0x1 ) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/threads.cpp:1382 I didn't realize you still also is using the frame basic lock area. (in other projects this is removed and all cases are handled via the threads lock stack) So essentially we have two lock stacks when running in interpreter the frame area and the LockStack. That explains why I have not heard anything about popframe and friends :) ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rehn at openjdk.org Thu Oct 6 07:47:09 2022 From: rehn at openjdk.org (Robbin Ehn) Date: Thu, 6 Oct 2022 07:47:09 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: On Mon, 8 Aug 2022 18:29:54 GMT, Roman Kennke wrote: > > I ran some test locally, 4 JDI fails and 3 JVM TI, all seems to fail in: > > ``` > > #7 0x00007f7cefc5c1ce in Thread::is_lock_owned (this=this at entry=0x7f7ce801dd90, adr=adr at entry=0x1 ) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/thread.cpp:549 > > #8 0x00007f7cef22c062 in JavaThread::is_lock_owned (this=0x7f7ce801dd90, adr=0x1 ) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/javaThread.cpp:979 > > #9 0x00007f7cefc79ab0 in Threads::owning_thread_from_monitor_owner (t_list=, owner=owner at entry=0x1 ) > > at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/threads.cpp:1382 > > ``` > > Thanks, Robbin! That was a bug in JvmtiBase::get_owning_thread() where an anonymous owner must be converted to the oop address before passing down to Threads::owning_thread_from_monitor_owner(). I pushed a fix. Can you re-test? Testing com/sun/jdi passes for me, now. Yes, that fixed it. I'm running more tests also. I got this build problem on aarch64: open/src/hotspot/share/asm/assembler.hpp:168), pid=3387376, tid=3387431 # assert(is_bound() || is_unused()) failed: Label was never bound to a location, but it was used as a jmp target V [libjvm.so+0x4f4788] Label::~Label()+0x48 V [libjvm.so+0x424a44] cmpFastLockNode::emit(CodeBuffer&, PhaseRegAlloc*) const+0x764 V [libjvm.so+0x1643888] PhaseOutput::fill_buffer(CodeBuffer*, unsigned int*)+0x538 V [libjvm.so+0xa85fcc] Compile::Code_Gen()+0x3bc ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 07:47:10 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 07:47:10 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: On Tue, 9 Aug 2022 09:19:54 GMT, Robbin Ehn wrote: > I got this build problem on aarch64: Thanks for giving this PR a spin. I pushed a fix for the aarch64 build problem (seems weird that GHA did not catch it). ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 07:47:06 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 07:47:06 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: <8MGsPdlBSWGR-pgF8_fLo_mez67z7nHWXg8UOcjJxIY=.38bd9c0f-3ba0-4ebe-867d-b54608f01e63@github.com> On Mon, 8 Aug 2022 12:14:38 GMT, David Holmes wrote: > The bar for acceptance for a brand new locking scheme with no fallback is extremely high and needs a lot of bake time and broad performance measurements, to watch for pathologies. That bar is lower if the scheme can be reverted to the old code if needed; and even lower still if the scheme is opt-in in the first place. For Java Object Monitors I made the new mechanism opt-in so the same could be done here. Granted it is not a trivial effort to do that, but I think a phased approach to transition to the new scheme is essential. It could be implemented as an experimental feature initially. Reverting a change should not be difficult. (Unless maybe another major change arrived in the meantime, which makes reverse-applying a patch non-trivial.) I'm skeptical to implement an opt-in runtime-switch, though. - Keeping the old paths side-by-side with the new paths is an engineering effort in itself, as you point out. It means that it, too, introduces significant risks to break locking, one way or the other (or both). - Making the new path opt-in means that we achieve almost nothing by it: testing code would still normally run the old paths (hopefully we didn't break it by making the change), and only use the new paths when explicitely told so, and I don't expect that many people voluntarily do that. It *may* be more useful to make it opt-out, as a quick fix if anybody experiences troubles with it. - Do we need runtime-switchable opt-in or opt-out flag for the initial testing and baking? I wouldn't think so: it seems better and cleaner to take the Git branch of this PR and put it through all relevant testing before the change goes in. - For how long do you think the runtime switch should stay? Because if it's all but temporary, it means we better test both paths thoroughly and automated. And it may also mean extra maintenance work (with extra avenues for bugs, see above), too. > I am not aware, please refresh my memory if you know different, of any core hotspot subsystem just being replaced in one fell swoop in one single release. Yes this needs a lot of testing but customers are not beta-testers. If this goes into a release on by default then there must be a way for customers to turn it off. UseHeavyMonitors is not a fallback as it is not for production use itself. So the new code has to co-exist along-side the old code as we make a transition across 2-3 releases. And yes that means a double-up on some testing as we already do for many things. I believe the least risky path overall is to make UseHeavyMonitors a production flag. Then it can act as a kill-switch for the new locking code, should anything go bad. I even considered to remove stack-locking altogether, and could only show minor performance impact, and always only in code that uses obsolete synchronized Java collections like Vector, Stack and StringBuffer. If you'd argue that it's too risky to use UseHeavyMonitors for that - then certainly you understand that the risk of introducing a new flag and manage two stack-locking subsystems would be even higher. There's a lot of code that is risky in itself to keep both paths. For example, I needed to change register allocation in the C2 .ad declarations and also in the interpreter/generated assembly code. It's hard enough to see that it is correct for one of the implementations, and much harder to implement and verify this correctly for two. > Any fast locking scheme benefits the uncontended sync case. So if you have a lot of contention and therefore a lot of inflation, the fast locking won't show any benefit. Not only that. As far as I can tell, 'heavy monitors' would only be worse off in workloads that 1. use uncontended sync and 2. churns monitors. Lots of uncontended sync on the same monitor object is not actually worse than fast-locking (it boils down to a single CAS in both cases). It only gets bad when code keeps allocating short-lived objects and syncs on them once or a few times only, and then moves on to the next new sync objects. > What "modern workloads" are you using to measure this? So far I tested with SPECjbb and SPECjvm-workloads-transplanted-into-JMH, dacapo and renaissance. I could only measure regressions with heavy monitors in workloads that use XML/XSLT, which I found out is because the XSTL compiler generates code that uses StringBuffer for (single-threaded) parsing. I also found a few other places in XML where usage of Stack and Vector has some impact. I can provide fixes for those, if needed (but I'm not sure whether this should go into JDK, upstream Xalan/Xerxes or both). > We eventually got rid of biased-locking because it no longer showed any benefit, so it is possible that fast locking (of whichever form) could go the same way. And we may have moved past heavy use of synchronized in general for that matter, especially as Loom instigated many changes over to java.util.concurrent locks. Yup. > Is UseHeavyMonitors in good enough shape to reliably be used for benchmark comparisons? Yes, except that the flag would have to be made product. Also, it is useful to use this PR instead of upstream JDK, because it simplifies the inflation protocol pretty much like it would be simplified without any stack-locking. I can make a standalone PR that gets rid of stack-locking altogether, if that is useful. Also keep in mind that both this fast-locking PR and total removal of stack-locking would enable some follow-up improvements: we'd no longer have to inflate monitors in order to install or read an i-hashcode. And GC code similarily may benefit from easier read/write of object age bits. This might benefit generational concurrent GC efforts. ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 07:47:08 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 07:47:08 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: On Mon, 8 Aug 2022 15:44:50 GMT, Robbin Ehn wrote: > I ran some test locally, 4 JDI fails and 3 JVM TI, all seems to fail in: > > ``` > #7 0x00007f7cefc5c1ce in Thread::is_lock_owned (this=this at entry=0x7f7ce801dd90, adr=adr at entry=0x1 ) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/thread.cpp:549 > #8 0x00007f7cef22c062 in JavaThread::is_lock_owned (this=0x7f7ce801dd90, adr=0x1 ) at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/javaThread.cpp:979 > #9 0x00007f7cefc79ab0 in Threads::owning_thread_from_monitor_owner (t_list=, owner=owner at entry=0x1 ) > at /home/rehn/source/jdk/ongit/dev-jdk/open/src/hotspot/share/runtime/threads.cpp:1382 > ``` Thanks, Robbin! That was a bug in JvmtiBase::get_owning_thread() where an anonymous owner must be converted to the oop address before passing down to Threads::owning_thread_from_monitor_owner(). I pushed a fix. Can you re-test? Testing com/sun/jdi passes for me, now. > I didn't realize you still also is using the frame basic lock area. (in other projects this is removed and all cases are handled via the threads lock stack) So essentially we have two lock stacks when running in interpreter the frame area and the LockStack. > > That explains why I have not heard anything about popframe and friends :) Hmm yeah, I also realized this recently :-D I will have to clean this up before going further. And I'll also will work to support the unstructured locking in the interpreter. ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rehn at openjdk.org Thu Oct 6 07:47:11 2022 From: rehn at openjdk.org (Robbin Ehn) Date: Thu, 6 Oct 2022 07:47:11 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: On Tue, 9 Aug 2022 10:46:51 GMT, Roman Kennke wrote: > Thanks for giving this PR a spin. I pushed a fix for the aarch64 build problem (seems weird that GHA did not catch it). NP, thanks. I notice some other user of owning_thread_from_monitor_owner() such as DeadlockCycle::print_on_with() which asserts on "assert(adr != reinterpret_cast(1)) failed: must convert to lock object". ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 07:47:12 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 07:47:12 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: <8MGsPdlBSWGR-pgF8_fLo_mez67z7nHWXg8UOcjJxIY=.38bd9c0f-3ba0-4ebe-867d-b54608f01e63@github.com> References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> <8MGsPdlBSWGR-pgF8_fLo_mez67z7nHWXg8UOcjJxIY=.38bd9c0f-3ba0-4ebe-867d-b54608f01e63@github.com> Message-ID: <9P1YMHdrh0hBVSsynwUQ5PVpU14yaF5V-00H5uWGLek=.fc7c13d9-3601-4f2e-8846-0b66eb0a13df@github.com> On Tue, 9 Aug 2022 09:32:47 GMT, Roman Kennke wrote: > I am not aware, please refresh my memory if you know different, of any core hotspot subsystem just being replaced in one fell swoop in one single release. Yes this needs a lot of testing but customers are not beta-testers. If this goes into a release on by default then there must be a way for customers to turn it off. UseHeavyMonitors is not a fallback as it is not for production use itself. So the new code has to co-exist along-side the old code as we make a transition across 2-3 releases. And yes that means a double-up on some testing as we already do for many things. Maybe it's worth to step back a little and discuss whether or not we actually want stack-locking (or a replacement) *at all*. My measurements seem to indicate that a majority of modern workloads (i.e. properly synchronized, not using legacy collections) actually benefit from running without stack-locking (or the fast-locking replacement). The workloads that suffer seem to be only such workloads which make heavy use of always-synchronized collections, code that we'd nowadays probably not consider 'idiomatic Java' anymore. This means that support for faster legacy code costs modern Java code actual performance points. Do we really want this? It may be wiser overall to simply drop stack-locking without replacement, and go and fix the identified locations where using of legacy collections affects performance negatively in the JDK (I found a few places in XML/XSLT code, for example). I am currently re-running my benchmarks to show this. ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 07:47:13 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 07:47:13 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: On Tue, 9 Aug 2022 11:05:45 GMT, Robbin Ehn wrote: > > Thanks for giving this PR a spin. I pushed a fix for the aarch64 build problem (seems weird that GHA did not catch it). > > NP, thanks. I notice some other user of owning_thread_from_monitor_owner() such as DeadlockCycle::print_on_with() which asserts on "assert(adr != reinterpret_cast(1)) failed: must convert to lock object". Do you know by any chance which tests trigger this? ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rehn at openjdk.org Thu Oct 6 07:47:13 2022 From: rehn at openjdk.org (Robbin Ehn) Date: Thu, 6 Oct 2022 07:47:13 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: <4eWUSTg-0XMN2ON3FYCM5uAUeIlarRNXgPJquBCXTQs=.5272e1a1-dcbb-4360-bb6d-9c0bc9d35313@github.com> On Thu, 11 Aug 2022 11:19:31 GMT, Roman Kennke wrote: > > > Thanks for giving this PR a spin. I pushed a fix for the aarch64 build problem (seems weird that GHA did not catch it). > > > > > > NP, thanks. I notice some other user of owning_thread_from_monitor_owner() such as DeadlockCycle::print_on_with() which asserts on "assert(adr != reinterpret_cast(1)) failed: must convert to lock object". > > Do you know by any chance which tests trigger this? Yes, there is a couple of to choose from, I think the jstack cmd may be easiest: jstack/DeadlockDetectionTest.java ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 07:47:14 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 07:47:14 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: On Thu, 28 Jul 2022 19:58:34 GMT, Roman Kennke wrote: > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 519.2 | 498.357 | 4.18% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) I added implementation for arm, ppc and s390 blindly. @shipilev, @tstuefe maybe you could sanity-check them? most likely they are buggy. I also haven't checked riscv at all, yet. ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 07:47:15 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 07:47:15 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: <4eWUSTg-0XMN2ON3FYCM5uAUeIlarRNXgPJquBCXTQs=.5272e1a1-dcbb-4360-bb6d-9c0bc9d35313@github.com> References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> <4eWUSTg-0XMN2ON3FYCM5uAUeIlarRNXgPJquBCXTQs=.5272e1a1-dcbb-4360-bb6d-9c0bc9d35313@github.com> Message-ID: On Thu, 11 Aug 2022 11:39:01 GMT, Robbin Ehn wrote: > > > > Thanks for giving this PR a spin. I pushed a fix for the aarch64 build problem (seems weird that GHA did not catch it). > > > > > > > > > NP, thanks. I notice some other user of owning_thread_from_monitor_owner() such as DeadlockCycle::print_on_with() which asserts on "assert(adr != reinterpret_cast(1)) failed: must convert to lock object". > > > > > > Do you know by any chance which tests trigger this? > > Yes, there is a couple of to choose from, I think the jstack cmd may be easiest: jstack/DeadlockDetectionTest.java I pushed a refactoring and fixes to the relevant code, and all users should now work correctly. It's passing test tiers1-3 and tier4 is running while I write this. @robehn or @dholmes-ora I believe one of you mentioned somewhere (can't find the comment, though) that we might need to support the bytecode sequence monitorenter A; monitorenter B; monitorexit A; monitorexit B; properly. I have now made a testcase that checks this, and it does indeed fail with this PR, while passing with upstream. Also, the JVM spec doesn't mention anywhere that it is required that monitorenter/exit are properly nested. I'll have to fix this in the interpreter (JIT compilers refuse to compile not-properly-nested monitorenter/exit anyway). See https://github.com/rkennke/jdk/blob/fast-locking/test/hotspot/jtreg/runtime/locking/TestUnstructuredLocking.jasm ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rehn at openjdk.org Thu Oct 6 07:47:15 2022 From: rehn at openjdk.org (Robbin Ehn) Date: Thu, 6 Oct 2022 07:47:15 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: <4eWUSTg-0XMN2ON3FYCM5uAUeIlarRNXgPJquBCXTQs=.5272e1a1-dcbb-4360-bb6d-9c0bc9d35313@github.com> References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> <4eWUSTg-0XMN2ON3FYCM5uAUeIlarRNXgPJquBCXTQs=.5272e1a1-dcbb-4360-bb6d-9c0bc9d35313@github.com> Message-ID: On Thu, 11 Aug 2022 11:39:01 GMT, Robbin Ehn wrote: >>> > Thanks for giving this PR a spin. I pushed a fix for the aarch64 build problem (seems weird that GHA did not catch it). >>> >>> NP, thanks. I notice some other user of owning_thread_from_monitor_owner() such as DeadlockCycle::print_on_with() which asserts on "assert(adr != reinterpret_cast(1)) failed: must convert to lock object". >> >> Do you know by any chance which tests trigger this? > >> > > Thanks for giving this PR a spin. I pushed a fix for the aarch64 build problem (seems weird that GHA did not catch it). >> > >> > >> > NP, thanks. I notice some other user of owning_thread_from_monitor_owner() such as DeadlockCycle::print_on_with() which asserts on "assert(adr != reinterpret_cast(1)) failed: must convert to lock object". >> >> Do you know by any chance which tests trigger this? > > Yes, there is a couple of to choose from, I think the jstack cmd may be easiest: jstack/DeadlockDetectionTest.java > @robehn or @dholmes-ora I believe one of you mentioned somewhere (can't find the comment, though) that we might need to support the bytecode sequence monitorenter A; monitorenter B; monitorexit A; monitorexit B; properly. I have now made a testcase that checks this, and it does indeed fail with this PR, while passing with upstream. Also, the JVM spec doesn't mention anywhere that it is required that monitorenter/exit are properly nested. I'll have to fix this in the interpreter (JIT compilers refuse to compile not-properly-nested monitorenter/exit anyway). > > See https://github.com/rkennke/jdk/blob/fast-locking/test/hotspot/jtreg/runtime/locking/TestUnstructuredLocking.jasm jvms-2.11.10 > Structured locking is the situation when, during a method invocation, every exit on a given monitor matches a preceding entry on that monitor. Since there is no assurance that all code submitted to the Java Virtual Machine will perform structured locking, implementations of the Java Virtual Machine are permitted but not required to enforce both of the following two rules guaranteeing structured locking. Let T be a thread and M be a monitor. Then: > > The number of monitor entries performed by T on M during a method invocation must equal the number of monitor exits performed by T on M during the method invocation whether the method invocation completes normally or abruptly. > > At no point during a method invocation may the number of monitor exits performed by T on M since the method invocation exceed the number of monitor entries performed by T on M since the method invocation. > > Note that the monitor entry and exit automatically performed by the Java Virtual Machine when invoking a synchronized method are considered to occur during the calling method's invocation. I think the intent of above was to allow enforcing structured locking. In relevant other projects, we support only structured locking in Java, but permit some unstructured locking when done via JNI. In that project JNI monitor enter/exit do not use the lockstack. I don't think we today fully support unstructured locking either: void foo_lock() { monitorenter(this); // If VM abruptly returns here 'this' will be unlocked // Because VM assumes structured locking. // see e.g. remove_activation(...) } *I scratch this as it was a bit off topic.* ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 07:47:16 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 07:47:16 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> <4eWUSTg-0XMN2ON3FYCM5uAUeIlarRNXgPJquBCXTQs=.5272e1a1-dcbb-4360-bb6d-9c0bc9d35313@github.com> Message-ID: On Tue, 16 Aug 2022 15:47:58 GMT, Robbin Ehn wrote: > > @robehn or @dholmes-ora I believe one of you mentioned somewhere (can't find the comment, though) that we might need to support the bytecode sequence monitorenter A; monitorenter B; monitorexit A; monitorexit B; properly. I have now made a testcase that checks this, and it does indeed fail with this PR, while passing with upstream. Also, the JVM spec doesn't mention anywhere that it is required that monitorenter/exit are properly nested. I'll have to fix this in the interpreter (JIT compilers refuse to compile not-properly-nested monitorenter/exit anyway). > > See https://github.com/rkennke/jdk/blob/fast-locking/test/hotspot/jtreg/runtime/locking/TestUnstructuredLocking.jasm > > jvms-2.11.10 > > > Structured locking is the situation when, during a method invocation, every exit on a given monitor matches a preceding entry on that monitor. Since there is no assurance that all code submitted to the Java Virtual Machine will perform structured locking, implementations of the Java Virtual Machine are permitted but not required to enforce both of the following two rules guaranteeing structured locking. Let T be a thread and M be a monitor. Then: > > The number of monitor entries performed by T on M during a method invocation must equal the number of monitor exits performed by T on M during the method invocation whether the method invocation completes normally or abruptly. > > At no point during a method invocation may the number of monitor exits performed by T on M since the method invocation exceed the number of monitor entries performed by T on M since the method invocation. > > Note that the monitor entry and exit automatically performed by the Java Virtual Machine when invoking a synchronized method are considered to occur during the calling method's invocation. > > I think the intent of above was to allow enforcing structured locking. TBH, I don't see how this affects the scenario that I'm testing. The scenario: monitorenter A; monitorenter B; monitorexit A; monitorexit B; violates any of the two conditions: - the number of monitorenters and -exits during the execution always matches - the number of monitorexits for each monitor does not exceed the number of monitorenters for the same monitor Strictly speaking, I believe the conditions check for the (weaker) balanced property, but not for the (stronger) structured property. > In relevant other projects, we support only structured locking in Java, but permit some unstructured locking when done via JNI. In that project JNI monitor enter/exit do not use the lockstack. Yeah, JNI locking always inflate and uses full monitors. My proposal hasn't changed this. > I don't think we today fully support unstructured locking either: > > ``` > void foo_lock() { > monitorenter(this); > // If VM abruptly returns here 'this' will be unlocked > // Because VM assumes structured locking. > // see e.g. remove_activation(...) > } > ``` > > _I scratch this as it was a bit off topic._ Hmm yeah, this is required for properly handling exceptions. I have seen this making a bit of a mess in C1 code. That said, unstructured locking today only ever works in the interpreter, the JIT compilers would refuse to compile unstructured locking code. So if somebody would come up with a language and compiler that emits unstructured (e.g. hand-over-hand) locks, it would run, but only very slowly. I think I know how to make my proposal handle unstructured locking properly: In the interpreter monitorexit, I can check the top of the lock-stack, and if it doesn't match, call into the runtime, and there it's easy to implement the unstructured scenario. ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rehn at openjdk.org Thu Oct 6 07:47:17 2022 From: rehn at openjdk.org (Robbin Ehn) Date: Thu, 6 Oct 2022 07:47:17 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> <4eWUSTg-0XMN2ON3FYCM5uAUeIlarRNXgPJquBCXTQs=.5272e1a1-dcbb-4360-bb6d-9c0bc9d35313@github.com> Message-ID: On Tue, 16 Aug 2022 16:21:04 GMT, Roman Kennke wrote: > Strictly speaking, I believe the conditions check for the (weaker) balanced property, but not for the (stronger) structured property. I know but the text says: - "every exit on a given monitor matches a preceding entry on that monitor." - "implementations of the Java Virtual Machine are permitted but not required to enforce both of the following two rules guaranteeing structured locking" I read this as if the rules do not guarantee structured locking the rules are not correct. The VM is allowed to enforce it. But thats just my take on it. EDIT: Maybe I'm reading to much into it. Lock A,B then unlock A,B maybe is considered structured locking? But then again what if: void foo_lock() { monitorenter(A); monitorenter(B); // If VM abruptly returns here // VM can unlock them in reverse order first B and then A ? monitorexit(A); monitorexit(B); } ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 07:47:17 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 07:47:17 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> <4eWUSTg-0XMN2ON3FYCM5uAUeIlarRNXgPJquBCXTQs=.5272e1a1-dcbb-4360-bb6d-9c0bc9d35313@github.com> Message-ID: On Wed, 17 Aug 2022 07:29:23 GMT, Robbin Ehn wrote: > > Strictly speaking, I believe the conditions check for the (weaker) balanced property, but not for the (stronger) structured property. > > I know but the text says: > > * "every exit on a given monitor matches a preceding entry on that monitor." > > * "implementations of the Java Virtual Machine are permitted but not required to enforce both of the following two rules guaranteeing structured locking" > > > I read this as if the rules do not guarantee structured locking the rules are not correct. The VM is allowed to enforce it. But thats just my take on it. > > EDIT: Maybe I'm reading to much into it. Lock A,B then unlock A,B maybe is considered structured locking? > > But then again what if: > > ``` > void foo_lock() { > monitorenter(A); > monitorenter(B); > // If VM abruptly returns here > // VM can unlock them in reverse order first B and then A ? > monitorexit(A); > monitorexit(B); > } > ``` Do you think there would be any chance to clarify the spec there? Or even outright disallow unstructured/not-properly-nested locking altogether (and maybe allow the verifier to check it)? That would certainly be the right thing to do. And, afaict, it would do no harm because no compiler of any language would ever emit unstructured locking anyway - because if it did, the resulting code would crawl interpreted-only). ------------- PR: https://git.openjdk.org/jdk/pull/9680 From kvn at openjdk.org Thu Oct 6 07:47:18 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Thu, 6 Oct 2022 07:47:18 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> <4eWUSTg-0XMN2ON3FYCM5uAUeIlarRNXgPJquBCXTQs=.5272e1a1-dcbb-4360-bb6d-9c0bc9d35313@github.com> Message-ID: On Wed, 17 Aug 2022 15:34:01 GMT, Roman Kennke wrote: >>> Strictly speaking, I believe the conditions check for the (weaker) balanced property, but not for the (stronger) structured property. >> >> I know but the text says: >> - "every exit on a given monitor matches a preceding entry on that monitor." >> - "implementations of the Java Virtual Machine are permitted but not required to enforce both of the following two rules guaranteeing structured locking" >> >> I read this as if the rules do not guarantee structured locking the rules are not correct. >> The VM is allowed to enforce it. >> But thats just my take on it. >> >> EDIT: >> Maybe I'm reading to much into it. >> Lock A,B then unlock A,B maybe is considered structured locking? >> >> But then again what if: >> >> >> void foo_lock() { >> monitorenter(A); >> monitorenter(B); >> // If VM abruptly returns here >> // VM can unlock them in reverse order first B and then A ? >> monitorexit(A); >> monitorexit(B); >> } > >> > Strictly speaking, I believe the conditions check for the (weaker) balanced property, but not for the (stronger) structured property. >> >> I know but the text says: >> >> * "every exit on a given monitor matches a preceding entry on that monitor." >> >> * "implementations of the Java Virtual Machine are permitted but not required to enforce both of the following two rules guaranteeing structured locking" >> >> >> I read this as if the rules do not guarantee structured locking the rules are not correct. The VM is allowed to enforce it. But thats just my take on it. >> >> EDIT: Maybe I'm reading to much into it. Lock A,B then unlock A,B maybe is considered structured locking? >> >> But then again what if: >> >> ``` >> void foo_lock() { >> monitorenter(A); >> monitorenter(B); >> // If VM abruptly returns here >> // VM can unlock them in reverse order first B and then A ? >> monitorexit(A); >> monitorexit(B); >> } >> ``` > > Do you think there would be any chance to clarify the spec there? Or even outright disallow unstructured/not-properly-nested locking altogether (and maybe allow the verifier to check it)? That would certainly be the right thing to do. And, afaict, it would do no harm because no compiler of any language would ever emit unstructured locking anyway - because if it did, the resulting code would crawl interpreted-only). We need to understand performance effects of these changes. I don't see data here or new JMH benchmarks which can show data. @rkennke can you show data you have? And, please, update RFE description with what you have in PR description. @ericcaspole do we have JMH benchmarks to test performance for different lock scenarios? I see few tests in `test/micro` which use `synchronized`. Are they enough? Or we need more? Do we have internal benchmarks we could use for such testing? I would prefer to have "opt-in" but looking on scope of changes it may introduce more issues. Without "opt-in" I want performance comparison of VMs with different implementation instead of using `UseHeavyMonitors` to make judgement about this implementation. `UseHeavyMonitors` (product flag) should be tested separately to make sure when it is used as fallback mechanism by customers they would not get significant performance penalty. I agree with @tstuefe that we should test this PR a lot (all tiers on all supported platforms) including performance testing before integration. In addition we need full testing of this implementation with `UseHeavyMonitors` ON. And I should repeat that integration happens when changes are ready (no issues). We should not rush for particular JDK release. ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 07:47:19 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 07:47:19 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: On Tue, 30 Aug 2022 11:52:24 GMT, Roman Kennke wrote: >> I didn't realize you still also is using the frame basic lock area. (in other projects this is removed and all cases are handled via the threads lock stack) >> So essentially we have two lock stacks when running in interpreter the frame area and the LockStack. >> >> That explains why I have not heard anything about popframe and friends :) > >> I didn't realize you still also is using the frame basic lock area. (in other projects this is removed and all cases are handled via the threads lock stack) So essentially we have two lock stacks when running in interpreter the frame area and the LockStack. >> >> That explains why I have not heard anything about popframe and friends :) > > Hmm yeah, I also realized this recently :-D > I will have to clean this up before going further. And I'll also will work to support the unstructured locking in the interpreter. > We need to understand performance effects of these changes. I don't see data here or new JMH benchmarks which can show data. @rkennke can you show data you have? And, please, update RFE description with what you have in PR description. I did run macro benchmarks (SPECjvm, SPECjbb, renaissance, dacapo) and there performance is most often <1% from baseline, some better, some worse. However, I noticed that I made a mistake in my benchmark setup, and I have to re-run them again. So far it doesn't look like the results will be much different - only more reliable. Before I do proper re-runs, I first want to work on removing the interpreter lock-stack, and also to support 'weird' locking (see discussion above). I don't expect those to affect performance very much, because it will only change the interpreter paths. I haven't run any microbenchmarks, yet, but it may be useful. If you have any, please point me in the direction. > I would prefer to have "opt-in" but looking on scope of changes it may introduce more issues. Without "opt-in" I want performance comparison of VMs with different implementation instead of using `UseHeavyMonitors` to make judgement about this implementation. `UseHeavyMonitors` (product flag) should be tested separately to make sure when it is used as fallback mechanism by customers they would not get significant performance penalty. Yes, I can do that. > I agree with @tstuefe that we should test this PR a lot (all tiers on all supported platforms) including performance testing before integration. In addition we need full testing of this implementation with `UseHeavyMonitors` ON. Ok. I'd also suggest to run relevant (i.e. what relates to synchronized) jcstress tests. > And I should repeat that integration happens when changes are ready (no issues). We should not rush for particular JDK release. Sure, I am not planning on rushing this. ;-) > I didn't realize you still also is using the frame basic lock area. (in other projects this is removed and all cases are handled via the threads lock stack) So essentially we have two lock stacks when running in interpreter the frame area and the LockStack. > > That explains why I have not heard anything about popframe and friends :) I'm now wondering if what I kinda accidentally did there is not the sane thing to do. The 'real' lock-stack (the one that I added) holds all the (fast-)locked oops. The frame basic lock area also holds oops now (before it was oop-lock pairs), and in addition to the per-thread lock-stack it also holds the association frame->locks, which is useful when popping interpreter frames, so that we can exit all active locks easily. C1 and C2 don't need this, because 1. the monitor enter and exit there is always symmetric and 2. they have their own and more efficient ways to remove activations. How have you handled the interpreter lock-stack-area in your implementation? Is it worth to get rid of it and consolidate with the per-thread lock-stack? ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rehn at openjdk.org Thu Oct 6 07:47:20 2022 From: rehn at openjdk.org (Robbin Ehn) Date: Thu, 6 Oct 2022 07:47:20 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: On Fri, 9 Sep 2022 19:01:14 GMT, Roman Kennke wrote: > How have you handled the interpreter lock-stack-area in your implementation? Is it worth to get rid of it and consolidate with the per-thread lock-stack? At the moment I had to store a "frame id" for each entry in the lock stack. The frame id is previous fp, grabbed from "link()" when entering the locking code. private static final void monitorEnter(Object o) { .... long monitorFrameId = getCallerFrameId(); ``` When popping we can thus check if there is still monitors/locks for the frame to be popped. Remove activation reads the lock stack, with a bunch of assembly, e.g.: ` access_load_at(T_INT, IN_HEAP, rax, Address(rax, java_lang_Thread::lock_stack_pos_offset()), noreg, noreg); ` If we would keep this, loom freezing would need to relativize and derelativize these values. (we only have interpreter) But, according to JVMS 2.11.10. the VM only needs to automatically unlock synchronized method. This code that unlocks all locks in the frame seems to have been added for JLS 17.1. I have asked for clarification and we only need and should care about JVMS. So if we could make popframe do more work (popframe needs to unlock all), there seems to be way forward allowing more flexibility. Still working on trying to make what we have public, even if it's in roughly shape and it's very unclear if that is the correct approach at all. ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 07:47:21 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 07:47:21 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: On Mon, 12 Sep 2022 06:37:19 GMT, Robbin Ehn wrote: > > How have you handled the interpreter lock-stack-area in your implementation? Is it worth to get rid of it and consolidate with the per-thread lock-stack? > > At the moment I had to store a "frame id" for each entry in the lock stack. The frame id is previous fp, grabbed from "link()" when entering the locking code. > > ``` > private static final void monitorEnter(Object o) { > .... > long monitorFrameId = getCallerFrameId(); > ``` > > When popping we can thus check if there is still monitors/locks for the frame to be popped. Remove activation reads the lock stack, with a bunch of assembly, e.g.: ` access_load_at(T_INT, IN_HEAP, rax, Address(rax, java_lang_Thread::lock_stack_pos_offset()), noreg, noreg);` If we would keep this, loom freezing would need to relativize and derelativize these values. (we only have interpreter) Hmm ok. I was thinking something similar, but instead of storing pairs (oop/frame-id), push frame-markers on the lock-stack. But given that we only need all this for the interpreter, I am wondering if keeping what we have now (e.g. the per-frame-lock-stack in interpreter frame) is the saner thing to do. The overhead seems very small, perhaps very similar to keeping track of frames in the per-thread lock-stack. > But, according to JVMS 2.11.10. the VM only needs to automatically unlock synchronized method. This code that unlocks all locks in the frame seems to have been added for JLS 17.1. I have asked for clarification and we only need and should care about JVMS. > > So if we could make popframe do more work (popframe needs to unlock all), there seems to be way forward allowing more flexibility. > Still working on trying to make what we have public, even if it's in roughly shape and it's very unclear if that is the correct approach at all. Nice! >From your snippets above I am gleaning that your implementation has the actual lock-stack in Java. Is that correct? Is there a particular reason why you need this? Is this for Loom? Would the implementation that I am proposing here also work for your use-case(s)? Thanks, Roman ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rehn at openjdk.org Thu Oct 6 07:47:22 2022 From: rehn at openjdk.org (Robbin Ehn) Date: Thu, 6 Oct 2022 07:47:22 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: On Mon, 12 Sep 2022 07:54:48 GMT, Roman Kennke wrote: > Nice! From your snippets above I am gleaning that your implementation has the actual lock-stack in Java. Is that correct? Is there a particular reason why you need this? Is this for Loom? Would the implementation that I am proposing here also work for your use-case(s)? > Yes, the entire implementation is in Java. void push(Object lockee, long fid) { if (this != Thread.currentThread()) Monitor.abort("invariant"); if (lockStackPos == lockStack.length) { grow(); } frameId[lockStackPos] = fid; lockStack[lockStackPos++] = lockee; } We are starting from the point of let's do everything be in Java. I want smart people to being able to change the implementation. So I really don't like the hardcoded assembly in remove_activation which do this check on frame id on the lock stack. If we can make the changes to e.g. popframe and take a bit different approach to JVMS we may have a total flexible Java implementation. But a flexible Java implementation means compiler can't have intrinsics, so what will the performance be.... We have more loose-ends than we can handle at the moment. Your code may be useable for JOM if we lock the implementation to using a lock-stack and we are going to write intrinsics to it. There is no point of it being in Java if so IMHO. ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rehn at openjdk.org Thu Oct 6 08:13:09 2022 From: rehn at openjdk.org (Robbin Ehn) Date: Thu, 6 Oct 2022 08:13:09 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: On Mon, 12 Sep 2022 07:54:48 GMT, Roman Kennke wrote: >>> How have you handled the interpreter lock-stack-area in your implementation? Is it worth to get rid of it and consolidate with the per-thread lock-stack? >> >> At the moment I had to store a "frame id" for each entry in the lock stack. >> The frame id is previous fp, grabbed from "link()" when entering the locking code. >> >> private static final void monitorEnter(Object o) { >> .... >> long monitorFrameId = getCallerFrameId(); >> ``` >> When popping we can thus check if there is still monitors/locks for the frame to be popped. >> Remove activation reads the lock stack, with a bunch of assembly, e.g.: >> ` access_load_at(T_INT, IN_HEAP, rax, Address(rax, java_lang_Thread::lock_stack_pos_offset()), noreg, noreg); >> ` >> If we would keep this, loom freezing would need to relativize and derelativize these values. >> (we only have interpreter) >> >> But, according to JVMS 2.11.10. the VM only needs to automatically unlock synchronized method. >> This code that unlocks all locks in the frame seems to have been added for JLS 17.1. >> I have asked for clarification and we only need and should care about JVMS. >> >> So if we could make popframe do more work (popframe needs to unlock all), there seems to be way forward allowing more flexibility. >> >> Still working on trying to make what we have public, even if it's in roughly shape and it's very unclear if that is the correct approach at all. > >> > How have you handled the interpreter lock-stack-area in your implementation? Is it worth to get rid of it and consolidate with the per-thread lock-stack? >> >> At the moment I had to store a "frame id" for each entry in the lock stack. The frame id is previous fp, grabbed from "link()" when entering the locking code. >> >> ``` >> private static final void monitorEnter(Object o) { >> .... >> long monitorFrameId = getCallerFrameId(); >> ``` >> >> When popping we can thus check if there is still monitors/locks for the frame to be popped. Remove activation reads the lock stack, with a bunch of assembly, e.g.: ` access_load_at(T_INT, IN_HEAP, rax, Address(rax, java_lang_Thread::lock_stack_pos_offset()), noreg, noreg);` If we would keep this, loom freezing would need to relativize and derelativize these values. (we only have interpreter) > > Hmm ok. I was thinking something similar, but instead of storing pairs (oop/frame-id), push frame-markers on the lock-stack. > > But given that we only need all this for the interpreter, I am wondering if keeping what we have now (e.g. the per-frame-lock-stack in interpreter frame) is the saner thing to do. The overhead seems very small, perhaps very similar to keeping track of frames in the per-thread lock-stack. > >> But, according to JVMS 2.11.10. the VM only needs to automatically unlock synchronized method. This code that unlocks all locks in the frame seems to have been added for JLS 17.1. I have asked for clarification and we only need and should care about JVMS. >> >> So if we could make popframe do more work (popframe needs to unlock all), there seems to be way forward allowing more flexibility. > >> Still working on trying to make what we have public, even if it's in roughly shape and it's very unclear if that is the correct approach at all. > > Nice! > From your snippets above I am gleaning that your implementation has the actual lock-stack in Java. Is that correct? Is there a particular reason why you need this? Is this for Loom? Would the implementation that I am proposing here also work for your use-case(s)? > > Thanks, > Roman @rkennke I will have a look, but may I suggest to open a new PR and just reference this as background discussion? I think most of the comments above is not relevant enough for a new reviewer to struggle through. What do you think? ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 09:39:31 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 09:39:31 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: On Mon, 12 Sep 2022 07:54:48 GMT, Roman Kennke wrote: >>> How have you handled the interpreter lock-stack-area in your implementation? Is it worth to get rid of it and consolidate with the per-thread lock-stack? >> >> At the moment I had to store a "frame id" for each entry in the lock stack. >> The frame id is previous fp, grabbed from "link()" when entering the locking code. >> >> private static final void monitorEnter(Object o) { >> .... >> long monitorFrameId = getCallerFrameId(); >> ``` >> When popping we can thus check if there is still monitors/locks for the frame to be popped. >> Remove activation reads the lock stack, with a bunch of assembly, e.g.: >> ` access_load_at(T_INT, IN_HEAP, rax, Address(rax, java_lang_Thread::lock_stack_pos_offset()), noreg, noreg); >> ` >> If we would keep this, loom freezing would need to relativize and derelativize these values. >> (we only have interpreter) >> >> But, according to JVMS 2.11.10. the VM only needs to automatically unlock synchronized method. >> This code that unlocks all locks in the frame seems to have been added for JLS 17.1. >> I have asked for clarification and we only need and should care about JVMS. >> >> So if we could make popframe do more work (popframe needs to unlock all), there seems to be way forward allowing more flexibility. >> >> Still working on trying to make what we have public, even if it's in roughly shape and it's very unclear if that is the correct approach at all. > >> > How have you handled the interpreter lock-stack-area in your implementation? Is it worth to get rid of it and consolidate with the per-thread lock-stack? >> >> At the moment I had to store a "frame id" for each entry in the lock stack. The frame id is previous fp, grabbed from "link()" when entering the locking code. >> >> ``` >> private static final void monitorEnter(Object o) { >> .... >> long monitorFrameId = getCallerFrameId(); >> ``` >> >> When popping we can thus check if there is still monitors/locks for the frame to be popped. Remove activation reads the lock stack, with a bunch of assembly, e.g.: ` access_load_at(T_INT, IN_HEAP, rax, Address(rax, java_lang_Thread::lock_stack_pos_offset()), noreg, noreg);` If we would keep this, loom freezing would need to relativize and derelativize these values. (we only have interpreter) > > Hmm ok. I was thinking something similar, but instead of storing pairs (oop/frame-id), push frame-markers on the lock-stack. > > But given that we only need all this for the interpreter, I am wondering if keeping what we have now (e.g. the per-frame-lock-stack in interpreter frame) is the saner thing to do. The overhead seems very small, perhaps very similar to keeping track of frames in the per-thread lock-stack. > >> But, according to JVMS 2.11.10. the VM only needs to automatically unlock synchronized method. This code that unlocks all locks in the frame seems to have been added for JLS 17.1. I have asked for clarification and we only need and should care about JVMS. >> >> So if we could make popframe do more work (popframe needs to unlock all), there seems to be way forward allowing more flexibility. > >> Still working on trying to make what we have public, even if it's in roughly shape and it's very unclear if that is the correct approach at all. > > Nice! > From your snippets above I am gleaning that your implementation has the actual lock-stack in Java. Is that correct? Is there a particular reason why you need this? Is this for Loom? Would the implementation that I am proposing here also work for your use-case(s)? > > Thanks, > Roman > @rkennke I will have a look, but may I suggest to open a new PR and just reference this as background discussion? I think most of the comments above is not relevant enough for a new reviewer to struggle through. What do you think? Ok, will do that. Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 10:22:14 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 10:22:14 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: On Thu, 28 Jul 2022 19:58:34 GMT, Roman Kennke wrote: > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 519.2 | 498.357 | 4.18% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) Closing this PR in favour of a new, clean PR. ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 10:22:14 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 10:22:14 GMT Subject: Withdrawn: 8291555: Replace stack-locking with fast-locking In-Reply-To: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: On Thu, 28 Jul 2022 19:58:34 GMT, Roman Kennke wrote: > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 519.2 | 498.357 | 4.18% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) This pull request has been closed without being integrated. ------------- PR: https://git.openjdk.org/jdk/pull/9680 From rkennke at openjdk.org Thu Oct 6 10:30:19 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 6 Oct 2022 10:30:19 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking Message-ID: This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. This change enables to simplify (and speed-up!) a lot of code: - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR ### Benchmarks All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. #### DaCapo/AArch64 Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? benchmark | baseline | fast-locking | % | size -- | -- | -- | -- | -- avrora | 27859 | 27563 | 1.07% | large batik | 20786 | 20847 | -0.29% | large biojava | 27421 | 27334 | 0.32% | default eclipse | 59918 | 60522 | -1.00% | large fop | 3670 | 3678 | -0.22% | default graphchi | 2088 | 2060 | 1.36% | default h2 | 297391 | 291292 | 2.09% | huge jme | 8762 | 8877 | -1.30% | default jython | 18938 | 18878 | 0.32% | default luindex | 1339 | 1325 | 1.06% | default lusearch | 918 | 936 | -1.92% | default pmd | 58291 | 58423 | -0.23% | large sunflow | 32617 | 24961 | 30.67% | large tomcat | 25481 | 25992 | -1.97% | large tradebeans | 314640 | 311706 | 0.94% | huge tradesoap | 107473 | 110246 | -2.52% | huge xalan | 6047 | 5882 | 2.81% | default zxing | 970 | 926 | 4.75% | default #### DaCapo/x86_64 The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. benchmark | baseline | fast-Locking | % | size -- | -- | -- | -- | -- avrora | 127690 | 126749 | 0.74% | large batik | 12736 | 12641 | 0.75% | large biojava | 15423 | 15404 | 0.12% | default eclipse | 41174 | 41498 | -0.78% | large fop | 2184 | 2172 | 0.55% | default graphchi | 1579 | 1560 | 1.22% | default h2 | 227614 | 230040 | -1.05% | huge jme | 8591 | 8398 | 2.30% | default jython | 13473 | 13356 | 0.88% | default luindex | 824 | 813 | 1.35% | default lusearch | 962 | 968 | -0.62% | default pmd | 40827 | 39654 | 2.96% | large sunflow | 53362 | 43475 | 22.74% | large tomcat | 27549 | 28029 | -1.71% | large tradebeans | 190757 | 190994 | -0.12% | huge tradesoap | 68099 | 67934 | 0.24% | huge xalan | 7969 | 8178 | -2.56% | default zxing | 1176 | 1148 | 2.44% | default #### Renaissance/AArch64 This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. benchmark | baseline | fast-locking | % -- | -- | -- | -- AkkaUct | 2558.832 | 2513.594 | 1.80% Reactors | 14715.626 | 14311.246 | 2.83% Als | 1851.485 | 1869.622 | -0.97% ChiSquare | 1007.788 | 1003.165 | 0.46% GaussMix | 1157.491 | 1149.969 | 0.65% LogRegression | 717.772 | 733.576 | -2.15% MovieLens | 7916.181 | 8002.226 | -1.08% NaiveBayes | 395.296 | 386.611 | 2.25% PageRank | 4294.939 | 4346.333 | -1.18% FjKmeans | 519.2 | 498.357 | 4.18% FutureGenetic | 2578.504 | 2589.255 | -0.42% Mnemonics | 4898.886 | 4903.689 | -0.10% ParMnemonics | 4260.507 | 4210.121 | 1.20% Scrabble | 139.37 | 138.312 | 0.76% RxScrabble | 320.114 | 322.651 | -0.79% Dotty | 1056.543 | 1068.492 | -1.12% ScalaDoku | 3443.117 | 3449.477 | -0.18% Philosophers | 24333.311 | 23438.22 | 3.82% ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% FinagleChirper | 6814.192 | 6853.38 | -0.57% FinagleHttp | 4762.902 | 4807.564 | -0.93% #### Renaissance/x86_64 benchmark | baseline | fast-locking | % -- | -- | -- | -- AkkaUct | 1117.185 | 1116.425 | 0.07% Reactors | 11561.354 | 11812.499 | -2.13% Als | 1580.838 | 1575.318 | 0.35% ChiSquare | 459.601 | 467.109 | -1.61% GaussMix | 705.944 | 685.595 | 2.97% LogRegression | 659.944 | 656.428 | 0.54% MovieLens | 7434.303 | 7592.271 | -2.08% NaiveBayes | 413.482 | 417.369 | -0.93% PageRank | 3259.233 | 3276.589 | -0.53% FjKmeans | 946.429 | 938.991 | 0.79% FutureGenetic | 1760.672 | 1815.272 | -3.01% Scrabble | 147.996 | 150.084 | -1.39% RxScrabble | 177.755 | 177.956 | -0.11% Dotty | 673.754 | 683.919 | -1.49% ScalaDoku | 2193.562 | 1958.419 | 12.01% ScalaKmeans | 165.376 | 168.925 | -2.10% ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. ### Testing - [x] tier1 (x86_64, aarch64, x86_32) - [x] tier2 (x86_64, aarch64) - [x] tier3 (x86_64, aarch64) - [x] tier4 (x86_64, aarch64) ------------- Commit messages: - Merge tag 'jdk-20+17' into fast-locking - Fix OSR packing in AArch64, part 2 - Fix OSR packing in AArch64 - Merge remote-tracking branch 'upstream/master' into fast-locking - Fix register in interpreter unlock x86_32 - Support unstructured locking in interpreter (x86 parts) - Support unstructured locking in interpreter (aarch64 and shared parts) - Merge branch 'master' into fast-locking - Merge branch 'master' into fast-locking - Added test for hand-over-hand locking - ... and 17 more: https://git.openjdk.org/jdk/compare/79ccc791...3ed51053 Changes: https://git.openjdk.org/jdk/pull/10590/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10590&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8291555 Stats: 3660 lines in 127 files changed: 650 ins; 2481 del; 529 mod Patch: https://git.openjdk.org/jdk/pull/10590.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10590/head:pull/10590 PR: https://git.openjdk.org/jdk/pull/10590 From duke at openjdk.org Thu Oct 6 13:08:32 2022 From: duke at openjdk.org (JervenBolleman) Date: Thu, 6 Oct 2022 13:08:32 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> References: <_l7L4QKD3xsKDBmYODw-ZByLKdKlymyNNMZU49ABkBg=.ce136e4e-24ba-434b-ba53-4f53a44ef915@github.com> Message-ID: On Thu, 28 Jul 2022 19:58:34 GMT, Roman Kennke wrote: > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 519.2 | 498.357 | 4.18% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) For those following along the new PR is https://github.com/openjdk/jdk/pull/10590 ------------- PR: https://git.openjdk.org/jdk/pull/9680 From jsjolen at openjdk.org Fri Oct 7 11:35:09 2022 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Fri, 7 Oct 2022 11:35:09 GMT Subject: RFR: 8294954: Remove superfluous ResourceMarks when using LogStream Message-ID: Hi, I went through all of the places where LogStreams are created and removed the unnecessary ResourceMarks. I also added a ResourceMark in one place, where it was needed because of a call to `::name_and_sig_as_C_string` and moved one to the smallest scope where it is used. ------------- Commit messages: - Remove unnecessary ResourceMarks Changes: https://git.openjdk.org/jdk/pull/10602/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10602&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8294954 Stats: 59 lines in 41 files changed: 2 ins; 57 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10602.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10602/head:pull/10602 PR: https://git.openjdk.org/jdk/pull/10602 From dholmes at openjdk.org Fri Oct 7 13:21:21 2022 From: dholmes at openjdk.org (David Holmes) Date: Fri, 7 Oct 2022 13:21:21 GMT Subject: RFR: 8294954: Remove superfluous ResourceMarks when using LogStream In-Reply-To: References: Message-ID: On Fri, 7 Oct 2022 11:19:55 GMT, Johan Sj?len wrote: > Hi, > > I went through all of the places where LogStreams are created and removed the unnecessary ResourceMarks. I also added a ResourceMark in one place, where it was needed because of a call to `::name_and_sig_as_C_string` and moved one to the smallest scope where it is used. How are you defining "unnecessary"? Are these unnecessary because there is zero resource allocation involved? Or "unnecessary" because a ResourceMark higher up the call stack covers it? ------------- PR: https://git.openjdk.org/jdk/pull/10602 From dholmes at openjdk.org Fri Oct 7 13:32:11 2022 From: dholmes at openjdk.org (David Holmes) Date: Fri, 7 Oct 2022 13:32:11 GMT Subject: RFR: 8294954: Remove superfluous ResourceMarks when using LogStream In-Reply-To: References: Message-ID: On Fri, 7 Oct 2022 11:19:55 GMT, Johan Sj?len wrote: > Hi, > > I went through all of the places where LogStreams are created and removed the unnecessary ResourceMarks. I also added a ResourceMark in one place, where it was needed because of a call to `::name_and_sig_as_C_string` and moved one to the smallest scope where it is used. I see now the bug report suggests these RM were in place because the stream itself may have needed them but that this is no longer the case. So was that the only reason for all these RMs? ------------- PR: https://git.openjdk.org/jdk/pull/10602 From jsjolen at openjdk.org Fri Oct 7 13:41:08 2022 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Fri, 7 Oct 2022 13:41:08 GMT Subject: RFR: 8294954: Remove superfluous ResourceMarks when using LogStream In-Reply-To: References: Message-ID: On Fri, 7 Oct 2022 13:28:58 GMT, David Holmes wrote: >I see now the bug report suggests these RM were in place because the stream itself may have needed them but that this is no longer the case. So was that the only reason for all these RMs? There are RMs that I've looked at but left intact because they did have other reasons for being there (typically: string allocating functions). So yes, `LogStream` should be the only reason for all these RMs. ------------- PR: https://git.openjdk.org/jdk/pull/10602 From jsjolen at openjdk.org Fri Oct 7 13:51:15 2022 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Fri, 7 Oct 2022 13:51:15 GMT Subject: RFR: 8294954: Remove superfluous ResourceMarks when using LogStream In-Reply-To: References: Message-ID: <3RXwTxz1C1mjzFvf-yKczgP4lCERhQQsJdCej7iXrFE=.38a314e4-70b5-4356-8360-1fbbbf68230b@github.com> On Fri, 7 Oct 2022 11:19:55 GMT, Johan Sj?len wrote: > Hi, > > I went through all of the places where LogStreams are created and removed the unnecessary ResourceMarks. I also added a ResourceMark in one place, where it was needed because of a call to `::name_and_sig_as_C_string` and moved one to the smallest scope where it is used. This PR does remove the RM in `VM_Operation::evaluate`, and I haven't checked all of the VM operations to see if anyone uses it. ------------- PR: https://git.openjdk.org/jdk/pull/10602 From duke at openjdk.org Sun Oct 9 06:45:10 2022 From: duke at openjdk.org (Tongbao Zhang) Date: Sun, 9 Oct 2022 06:45:10 GMT Subject: RFR: 8293782: Shenandoah: some tests failed on lock rank check [v2] In-Reply-To: References: Message-ID: > After [JDK-8290025](https://bugs.openjdk.org/browse/JDK-8290025), some tests using ShenandoahGC failed on the lock rank check between AdapterHandlerLibrary_lock and ShenandoahRequestedGC_lock > > Symptom > > # > # A fatal error has been detected by the Java Runtime Environment: > # > # Internal Error (/data1/ws/jdk/src/hotspot/share/runtime/mutex.cpp:454), pid=2018566, tid=2022220 > # assert(false) failed: Attempting to acquire lock ShenandoahRequestedGC_lock/safepoint-1 out of order with lock AdapterHandlerLibrary_lock/safepoint-1 -- possible deadlock > # > # JRE version: OpenJDK Runtime Environment (20.0) (slowdebug build 20-internal-adhoc.root.jdk) > # Java VM: OpenJDK 64-Bit Server VM (slowdebug 20-internal-adhoc.root.jdk, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, shenandoah gc, linux-amd64) > # Problematic frame: > # V [libjvm.so+0x106fd6a] Mutex::check_rank(Thread*)+0x426 Tongbao Zhang has updated the pull request incrementally with one additional commit since the last revision: update rank of _alloc_failure_waiters_lock ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10264/files - new: https://git.openjdk.org/jdk/pull/10264/files/23f44fbd..87675608 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10264&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10264&range=00-01 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10264.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10264/head:pull/10264 PR: https://git.openjdk.org/jdk/pull/10264 From duke at openjdk.org Sun Oct 9 06:45:11 2022 From: duke at openjdk.org (Tongbao Zhang) Date: Sun, 9 Oct 2022 06:45:11 GMT Subject: RFR: 8293782: Shenandoah: some tests failed on lock rank check In-Reply-To: References: Message-ID: <7erfXFkhlNdrcP0Pfuw_BzaY0T7g1GqD5dIBDoAMfTE=.2798b236-d61a-484d-a8dc-d2b8f311cb0c@github.com> On Wed, 14 Sep 2022 07:01:52 GMT, Tongbao Zhang wrote: > After [JDK-8290025](https://bugs.openjdk.org/browse/JDK-8290025), some tests using ShenandoahGC failed on the lock rank check between AdapterHandlerLibrary_lock and ShenandoahRequestedGC_lock > > Symptom > > # > # A fatal error has been detected by the Java Runtime Environment: > # > # Internal Error (/data1/ws/jdk/src/hotspot/share/runtime/mutex.cpp:454), pid=2018566, tid=2022220 > # assert(false) failed: Attempting to acquire lock ShenandoahRequestedGC_lock/safepoint-1 out of order with lock AdapterHandlerLibrary_lock/safepoint-1 -- possible deadlock > # > # JRE version: OpenJDK Runtime Environment (20.0) (slowdebug build 20-internal-adhoc.root.jdk) > # Java VM: OpenJDK 64-Bit Server VM (slowdebug 20-internal-adhoc.root.jdk, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, shenandoah gc, linux-amd64) > # Problematic frame: > # V [libjvm.so+0x106fd6a] Mutex::check_rank(Thread*)+0x426 > Thanks for reminding! updated the rank of `_alloc_failure_waiters_lock ` ------------- PR: https://git.openjdk.org/jdk/pull/10264 From shade at openjdk.org Mon Oct 10 12:56:51 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 10 Oct 2022 12:56:51 GMT Subject: RFR: 8293782: Shenandoah: some tests failed on lock rank check [v2] In-Reply-To: References: Message-ID: On Sun, 9 Oct 2022 06:45:10 GMT, Tongbao Zhang wrote: >> After [JDK-8290025](https://bugs.openjdk.org/browse/JDK-8290025), some tests using ShenandoahGC failed on the lock rank check between AdapterHandlerLibrary_lock and ShenandoahRequestedGC_lock >> >> Symptom >> >> # >> # A fatal error has been detected by the Java Runtime Environment: >> # >> # Internal Error (/data1/ws/jdk/src/hotspot/share/runtime/mutex.cpp:454), pid=2018566, tid=2022220 >> # assert(false) failed: Attempting to acquire lock ShenandoahRequestedGC_lock/safepoint-1 out of order with lock AdapterHandlerLibrary_lock/safepoint-1 -- possible deadlock >> # >> # JRE version: OpenJDK Runtime Environment (20.0) (slowdebug build 20-internal-adhoc.root.jdk) >> # Java VM: OpenJDK 64-Bit Server VM (slowdebug 20-internal-adhoc.root.jdk, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, shenandoah gc, linux-amd64) >> # Problematic frame: >> # V [libjvm.so+0x106fd6a] Mutex::check_rank(Thread*)+0x426 > > Tongbao Zhang has updated the pull request incrementally with one additional commit since the last revision: > > update rank of _alloc_failure_waiters_lock This looks good, thank you! (I tested `hotspot:tier1` with Shenandoah, and it now passes cleanly) ------------- Marked as reviewed by shade (Reviewer). PR: https://git.openjdk.org/jdk/pull/10264 From duke at openjdk.org Tue Oct 11 09:57:36 2022 From: duke at openjdk.org (Tongbao Zhang) Date: Tue, 11 Oct 2022 09:57:36 GMT Subject: RFR: 8293782: Shenandoah: some tests failed on lock rank check [v2] In-Reply-To: References: Message-ID: On Mon, 10 Oct 2022 12:52:55 GMT, Aleksey Shipilev wrote: >> Tongbao Zhang has updated the pull request incrementally with one additional commit since the last revision: >> >> update rank of _alloc_failure_waiters_lock > > This looks good, thank you! (I tested `hotspot:tier1` with Shenandoah, and it now passes cleanly) Thanks for reviews! @shipilev @TheRealMDoerr ------------- PR: https://git.openjdk.org/jdk/pull/10264 From duke at openjdk.org Tue Oct 11 10:07:57 2022 From: duke at openjdk.org (Tongbao Zhang) Date: Tue, 11 Oct 2022 10:07:57 GMT Subject: Integrated: 8293782: Shenandoah: some tests failed on lock rank check In-Reply-To: References: Message-ID: On Wed, 14 Sep 2022 07:01:52 GMT, Tongbao Zhang wrote: > After [JDK-8290025](https://bugs.openjdk.org/browse/JDK-8290025), some tests using ShenandoahGC failed on the lock rank check between AdapterHandlerLibrary_lock and ShenandoahRequestedGC_lock > > Symptom > > # > # A fatal error has been detected by the Java Runtime Environment: > # > # Internal Error (/data1/ws/jdk/src/hotspot/share/runtime/mutex.cpp:454), pid=2018566, tid=2022220 > # assert(false) failed: Attempting to acquire lock ShenandoahRequestedGC_lock/safepoint-1 out of order with lock AdapterHandlerLibrary_lock/safepoint-1 -- possible deadlock > # > # JRE version: OpenJDK Runtime Environment (20.0) (slowdebug build 20-internal-adhoc.root.jdk) > # Java VM: OpenJDK 64-Bit Server VM (slowdebug 20-internal-adhoc.root.jdk, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, shenandoah gc, linux-amd64) > # Problematic frame: > # V [libjvm.so+0x106fd6a] Mutex::check_rank(Thread*)+0x426 This pull request has now been integrated. Changeset: 6053bf0f Author: Tongbao Zhang Committer: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/6053bf0f6a754bf3943ba6169316513055a5a3b2 Stats: 2 lines in 1 file changed: 0 ins; 0 del; 2 mod 8293782: Shenandoah: some tests failed on lock rank check Reviewed-by: mdoerr, shade ------------- PR: https://git.openjdk.org/jdk/pull/10264 From rkennke at openjdk.org Tue Oct 11 12:16:23 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 11 Oct 2022 12:16:23 GMT Subject: RFR: 8294775: Shenandoah: reduce contention on _threads_in_evac In-Reply-To: References: Message-ID: On Wed, 5 Oct 2022 11:10:29 GMT, Nick Gasson wrote: > The idea here is to reduce contention on the shared `_threads_in_evac` counter by splitting its state over multiple independent cache lines. Each thread hashes to one particular counter based on its `Thread*`. This helps improve throughput of concurrent evacuation where many Java threads may be attempting to update this counter on the load barrier slow path. > > See this earlier thread for details and SPECjbb results: https://mail.openjdk.org/pipermail/shenandoah-dev/2022-October/017494.html > > Also tested `hotspot_gc_shenandoah` on x86 and AArch64. Hi Nick, Thank you, that is a useful change! I verified performance and it does improve both throughput and latency on several machines (not as much as for you - but I also have not thrown so many CPUs at it.. ) I do have a few suggestions. src/hotspot/share/gc/shenandoah/shenandoahEvacOOMHandler.cpp line 40: > 38: > 39: volatile jint *ShenandoahEvacOOMHandler::threads_in_evac_ptr(Thread* t) { > 40: uint64_t key = (uintptr_t)t; Maybe put that in a separate hash(Thread*) function? Also, is that a particular documented hash-function?(Related: In Lilliput project, I am working on a different identity-hash-code implementation, and part of it will be a hash-implementation to hash arbitrary pointers to 32 or 64 bit hash, currently using murmur3. Maybe this could be reused for here, when it happens?) src/hotspot/share/gc/shenandoah/shenandoahEvacOOMHandler.cpp line 55: > 53: // *and* the counter is zero. > 54: while (Atomic::load_acquire(ptr) != OOM_MARKER_MASK) { > 55: os::naked_short_sleep(1); Not sure if SpinPause() may be better here? @shipilev probably knows more. src/hotspot/share/gc/shenandoah/shenandoahEvacOOMHandler.hpp line 88: > 86: static const jint OOM_MARKER_MASK; > 87: > 88: static constexpr jint EVAC_COUNTER_BUCKETS = 64; Maybe it'd be useful to not hardwire this? It could be a runtime option, possibly diagnostic (not sure). Many workloads would not even use so many threads... src/hotspot/share/gc/shenandoah/shenandoahEvacOOMHandler.hpp line 92: > 90: shenandoah_padding(0); > 91: struct { > 92: volatile jint bits; The bits field needs a comment saying that it combines a counter with an OOM bit. In-fact, it would probably benefit from a little bit of refactoring, make it a class, and move accessors and relevant methods into it, and avoid public access to the field? ------------- Changes requested by rkennke (Reviewer). PR: https://git.openjdk.org/jdk/pull/10573 From ngasson at openjdk.org Tue Oct 11 12:36:28 2022 From: ngasson at openjdk.org (Nick Gasson) Date: Tue, 11 Oct 2022 12:36:28 GMT Subject: RFR: 8294775: Shenandoah: reduce contention on _threads_in_evac In-Reply-To: References: Message-ID: On Wed, 5 Oct 2022 11:10:29 GMT, Nick Gasson wrote: > The idea here is to reduce contention on the shared `_threads_in_evac` counter by splitting its state over multiple independent cache lines. Each thread hashes to one particular counter based on its `Thread*`. This helps improve throughput of concurrent evacuation where many Java threads may be attempting to update this counter on the load barrier slow path. > > See this earlier thread for details and SPECjbb results: https://mail.openjdk.org/pipermail/shenandoah-dev/2022-October/017494.html > > Also tested `hotspot_gc_shenandoah` on x86 and AArch64. > Thank you, that is a useful change! I verified performance and it does improve both throughput and latency on several machines (not as much as for you - but I also have not thrown so many CPUs at it.. ) Thanks for testing! The improvement is quite dependent on the machine you're using (the 160-core one is probably an outlier ;-), and there's a marked difference between NUMA and non-NUMA systems. ------------- PR: https://git.openjdk.org/jdk/pull/10573 From ngasson at openjdk.org Tue Oct 11 12:36:31 2022 From: ngasson at openjdk.org (Nick Gasson) Date: Tue, 11 Oct 2022 12:36:31 GMT Subject: RFR: 8294775: Shenandoah: reduce contention on _threads_in_evac In-Reply-To: References: Message-ID: <7mpUhXJtmnGLJ1qqMtbAYNnGPIdTaYVjQjEIAhecNds=.7d5d02c4-6506-440f-969e-4a26e5f057ca@github.com> On Tue, 11 Oct 2022 12:02:43 GMT, Roman Kennke wrote: >> The idea here is to reduce contention on the shared `_threads_in_evac` counter by splitting its state over multiple independent cache lines. Each thread hashes to one particular counter based on its `Thread*`. This helps improve throughput of concurrent evacuation where many Java threads may be attempting to update this counter on the load barrier slow path. >> >> See this earlier thread for details and SPECjbb results: https://mail.openjdk.org/pipermail/shenandoah-dev/2022-October/017494.html >> >> Also tested `hotspot_gc_shenandoah` on x86 and AArch64. > > src/hotspot/share/gc/shenandoah/shenandoahEvacOOMHandler.cpp line 40: > >> 38: >> 39: volatile jint *ShenandoahEvacOOMHandler::threads_in_evac_ptr(Thread* t) { >> 40: uint64_t key = (uintptr_t)t; > > Maybe put that in a separate hash(Thread*) function? Also, is that a particular documented hash-function?(Related: In Lilliput project, I am working on a different identity-hash-code implementation, and part of it will be a hash-implementation to hash arbitrary pointers to 32 or 64 bit hash, currently using murmur3. Maybe this could be reused for here, when it happens?) It is actually the bit mixing function from MurmurHash3. The particular algorithm doesn't matter too much though - I just couldn't find an existing one in the shared code. > src/hotspot/share/gc/shenandoah/shenandoahEvacOOMHandler.cpp line 55: > >> 53: // *and* the counter is zero. >> 54: while (Atomic::load_acquire(ptr) != OOM_MARKER_MASK) { >> 55: os::naked_short_sleep(1); > > Not sure if SpinPause() may be better here? @shipilev probably knows more. I think we'd probably want some back-off here rather than spinning indefinitely? E.g. spin N times and then start sleeping. > src/hotspot/share/gc/shenandoah/shenandoahEvacOOMHandler.hpp line 88: > >> 86: static const jint OOM_MARKER_MASK; >> 87: >> 88: static constexpr jint EVAC_COUNTER_BUCKETS = 64; > > Maybe it'd be useful to not hardwire this? It could be a runtime option, possibly diagnostic (not sure). Many workloads would not even use so many threads... If we're going to make it dynamic maybe it should be set to the number of physical CPUs? ------------- PR: https://git.openjdk.org/jdk/pull/10573 From shade at openjdk.org Tue Oct 11 18:14:17 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Tue, 11 Oct 2022 18:14:17 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: References: Message-ID: On Thu, 6 Oct 2022 10:23:04 GMT, Roman Kennke wrote: > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 496.076 | 493.873 | 0.45% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaKmeans | 259.384 | 258.648 | 0.28% > Philosophers | 24333.311 | 23438.22 | 3.82% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > ParMnemonics | 2016.917 | 2033.101 | -0.80% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaDoku | 2193.562 | 1958.419 | 12.01% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > Philosophers | 14268.449 | 13308.87 | 7.21% > FinagleChirper | 4722.13 | 4688.3 | 0.72% > FinagleHttp | 3497.241 | 3605.118 | -2.99% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) I have a few questions after porting this to RISC-V... src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 272: > 270: // SharedRuntime::OSR_migration_begin() packs BasicObjectLocks in > 271: // the OSR buffer using 2 word entries: first the lock and then > 272: // the oop. This comment is now irrelevant? src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 432: > 430: if (method()->is_synchronized()) { > 431: monitor_address(0, FrameMap::r0_opr); > 432: __ ldr(r4, Address(r0, BasicObjectLock::obj_offset_in_bytes())); Do we have to use a new register here, or can we just reuse `r0`? src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp line 1886: > 1884: > 1885: __ mov(c_rarg0, obj_reg); > 1886: __ mov(c_rarg1, rthread); Now that you dropped an argument here, you need to do `__ call_VM_leaf` with `2`, not with `3` arguments? ------------- PR: https://git.openjdk.org/jdk/pull/10590 From rkennke at openjdk.org Tue Oct 11 19:49:32 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 11 Oct 2022 19:49:32 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v2] In-Reply-To: References: Message-ID: <4G3892Q41Qwlt15Y1dmLWkNUmyIEusWVJH2fdb3K0eM=.5ff1859b-baa1-4d60-866b-8e9747a79180@github.com> > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 496.076 | 493.873 | 0.45% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaKmeans | 259.384 | 258.648 | 0.28% > Philosophers | 24333.311 | 23438.22 | 3.82% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > ParMnemonics | 2016.917 | 2033.101 | -0.80% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaDoku | 2193.562 | 1958.419 | 12.01% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > Philosophers | 14268.449 | 13308.87 | 7.21% > FinagleChirper | 4722.13 | 4688.3 | 0.72% > FinagleHttp | 3497.241 | 3605.118 | -2.99% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: Fix number of rt args to complete_monitor_locking_C, remove some comments ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10590/files - new: https://git.openjdk.org/jdk/pull/10590/files/3ed51053..34bed54f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10590&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10590&range=00-01 Stats: 13 lines in 6 files changed: 0 ins; 11 del; 2 mod Patch: https://git.openjdk.org/jdk/pull/10590.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10590/head:pull/10590 PR: https://git.openjdk.org/jdk/pull/10590 From rkennke at openjdk.org Tue Oct 11 20:01:32 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 11 Oct 2022 20:01:32 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v3] In-Reply-To: References: Message-ID: > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 496.076 | 493.873 | 0.45% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaKmeans | 259.384 | 258.648 | 0.28% > Philosophers | 24333.311 | 23438.22 | 3.82% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > ParMnemonics | 2016.917 | 2033.101 | -0.80% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaDoku | 2193.562 | 1958.419 | 12.01% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > Philosophers | 14268.449 | 13308.87 | 7.21% > FinagleChirper | 4722.13 | 4688.3 | 0.72% > FinagleHttp | 3497.241 | 3605.118 | -2.99% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) Roman Kennke has updated the pull request incrementally with two additional commits since the last revision: - Merge remote-tracking branch 'origin/fast-locking' into fast-locking - Re-use r0 in call to unlock_object() ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10590/files - new: https://git.openjdk.org/jdk/pull/10590/files/34bed54f..4ccdab8f Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10590&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10590&range=01-02 Stats: 7 lines in 3 files changed: 0 ins; 1 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/10590.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10590/head:pull/10590 PR: https://git.openjdk.org/jdk/pull/10590 From rkennke at openjdk.org Tue Oct 11 20:01:33 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Tue, 11 Oct 2022 20:01:33 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v3] In-Reply-To: References: Message-ID: On Tue, 11 Oct 2022 13:25:30 GMT, Aleksey Shipilev wrote: >> Roman Kennke has updated the pull request incrementally with two additional commits since the last revision: >> >> - Merge remote-tracking branch 'origin/fast-locking' into fast-locking >> - Re-use r0 in call to unlock_object() > > src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 272: > >> 270: // SharedRuntime::OSR_migration_begin() packs BasicObjectLocks in >> 271: // the OSR buffer using 2 word entries: first the lock and then >> 272: // the oop. > > This comment is now irrelevant? Yes, removed it there and in same files in other arches. > src/hotspot/cpu/aarch64/c1_LIRAssembler_aarch64.cpp line 432: > >> 430: if (method()->is_synchronized()) { >> 431: monitor_address(0, FrameMap::r0_opr); >> 432: __ ldr(r4, Address(r0, BasicObjectLock::obj_offset_in_bytes())); > > Do we have to use a new register here, or can we just reuse `r0`? r0 is used below in call to unlock_object(), but not actually used there. I shuffled it a little and re-use r0 now. > src/hotspot/cpu/aarch64/sharedRuntime_aarch64.cpp line 1886: > >> 1884: >> 1885: __ mov(c_rarg0, obj_reg); >> 1886: __ mov(c_rarg1, rthread); > > Now that you dropped an argument here, you need to do `__ call_VM_leaf` with `2`, not with `3` arguments? Good catch! Yes. ------------- PR: https://git.openjdk.org/jdk/pull/10590 From rehn at openjdk.org Tue Oct 11 20:44:06 2022 From: rehn at openjdk.org (Robbin Ehn) Date: Tue, 11 Oct 2022 20:44:06 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v3] In-Reply-To: References: Message-ID: On Tue, 11 Oct 2022 20:01:32 GMT, Roman Kennke wrote: >> This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. >> >> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. >> >> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR >> >> ### Benchmarks >> >> All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. >> >> #### DaCapo/AArch64 >> >> Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? >> >> benchmark | baseline | fast-locking | % | size >> -- | -- | -- | -- | -- >> avrora | 27859 | 27563 | 1.07% | large >> batik | 20786 | 20847 | -0.29% | large >> biojava | 27421 | 27334 | 0.32% | default >> eclipse | 59918 | 60522 | -1.00% | large >> fop | 3670 | 3678 | -0.22% | default >> graphchi | 2088 | 2060 | 1.36% | default >> h2 | 297391 | 291292 | 2.09% | huge >> jme | 8762 | 8877 | -1.30% | default >> jython | 18938 | 18878 | 0.32% | default >> luindex | 1339 | 1325 | 1.06% | default >> lusearch | 918 | 936 | -1.92% | default >> pmd | 58291 | 58423 | -0.23% | large >> sunflow | 32617 | 24961 | 30.67% | large >> tomcat | 25481 | 25992 | -1.97% | large >> tradebeans | 314640 | 311706 | 0.94% | huge >> tradesoap | 107473 | 110246 | -2.52% | huge >> xalan | 6047 | 5882 | 2.81% | default >> zxing | 970 | 926 | 4.75% | default >> >> #### DaCapo/x86_64 >> >> The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. >> >> benchmark | baseline | fast-Locking | % | size >> -- | -- | -- | -- | -- >> avrora | 127690 | 126749 | 0.74% | large >> batik | 12736 | 12641 | 0.75% | large >> biojava | 15423 | 15404 | 0.12% | default >> eclipse | 41174 | 41498 | -0.78% | large >> fop | 2184 | 2172 | 0.55% | default >> graphchi | 1579 | 1560 | 1.22% | default >> h2 | 227614 | 230040 | -1.05% | huge >> jme | 8591 | 8398 | 2.30% | default >> jython | 13473 | 13356 | 0.88% | default >> luindex | 824 | 813 | 1.35% | default >> lusearch | 962 | 968 | -0.62% | default >> pmd | 40827 | 39654 | 2.96% | large >> sunflow | 53362 | 43475 | 22.74% | large >> tomcat | 27549 | 28029 | -1.71% | large >> tradebeans | 190757 | 190994 | -0.12% | huge >> tradesoap | 68099 | 67934 | 0.24% | huge >> xalan | 7969 | 8178 | -2.56% | default >> zxing | 1176 | 1148 | 2.44% | default >> >> #### Renaissance/AArch64 >> >> This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. >> >> benchmark | baseline | fast-locking | % >> -- | -- | -- | -- >> AkkaUct | 2558.832 | 2513.594 | 1.80% >> Reactors | 14715.626 | 14311.246 | 2.83% >> Als | 1851.485 | 1869.622 | -0.97% >> ChiSquare | 1007.788 | 1003.165 | 0.46% >> GaussMix | 1157.491 | 1149.969 | 0.65% >> LogRegression | 717.772 | 733.576 | -2.15% >> MovieLens | 7916.181 | 8002.226 | -1.08% >> NaiveBayes | 395.296 | 386.611 | 2.25% >> PageRank | 4294.939 | 4346.333 | -1.18% >> FjKmeans | 496.076 | 493.873 | 0.45% >> FutureGenetic | 2578.504 | 2589.255 | -0.42% >> Mnemonics | 4898.886 | 4903.689 | -0.10% >> ParMnemonics | 4260.507 | 4210.121 | 1.20% >> Scrabble | 139.37 | 138.312 | 0.76% >> RxScrabble | 320.114 | 322.651 | -0.79% >> Dotty | 1056.543 | 1068.492 | -1.12% >> ScalaDoku | 3443.117 | 3449.477 | -0.18% >> ScalaKmeans | 259.384 | 258.648 | 0.28% >> Philosophers | 24333.311 | 23438.22 | 3.82% >> ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% >> FinagleChirper | 6814.192 | 6853.38 | -0.57% >> FinagleHttp | 4762.902 | 4807.564 | -0.93% >> >> #### Renaissance/x86_64 >> >> benchmark | baseline | fast-locking | % >> -- | -- | -- | -- >> AkkaUct | 1117.185 | 1116.425 | 0.07% >> Reactors | 11561.354 | 11812.499 | -2.13% >> Als | 1580.838 | 1575.318 | 0.35% >> ChiSquare | 459.601 | 467.109 | -1.61% >> GaussMix | 705.944 | 685.595 | 2.97% >> LogRegression | 659.944 | 656.428 | 0.54% >> MovieLens | 7434.303 | 7592.271 | -2.08% >> NaiveBayes | 413.482 | 417.369 | -0.93% >> PageRank | 3259.233 | 3276.589 | -0.53% >> FjKmeans | 946.429 | 938.991 | 0.79% >> FutureGenetic | 1760.672 | 1815.272 | -3.01% >> ParMnemonics | 2016.917 | 2033.101 | -0.80% >> Scrabble | 147.996 | 150.084 | -1.39% >> RxScrabble | 177.755 | 177.956 | -0.11% >> Dotty | 673.754 | 683.919 | -1.49% >> ScalaDoku | 2193.562 | 1958.419 | 12.01% >> ScalaKmeans | 165.376 | 168.925 | -2.10% >> ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% >> Philosophers | 14268.449 | 13308.87 | 7.21% >> FinagleChirper | 4722.13 | 4688.3 | 0.72% >> FinagleHttp | 3497.241 | 3605.118 | -2.99% >> >> Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. >> >> I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). >> >> Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. >> >> ### Testing >> - [x] tier1 (x86_64, aarch64, x86_32) >> - [x] tier2 (x86_64, aarch64) >> - [x] tier3 (x86_64, aarch64) >> - [x] tier4 (x86_64, aarch64) > > Roman Kennke has updated the pull request incrementally with two additional commits since the last revision: > > - Merge remote-tracking branch 'origin/fast-locking' into fast-locking > - Re-use r0 in call to unlock_object() Regarding benchmarks, is it possible to get some indication what fast-locking+lillput result will be? FinagleHttp seems to suffer a bit, will Lillput give some/all of that back, or more? ------------- PR: https://git.openjdk.org/jdk/pull/10590 From shade at openjdk.org Wed Oct 12 11:30:07 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 12 Oct 2022 11:30:07 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v3] In-Reply-To: References: Message-ID: On Tue, 11 Oct 2022 20:01:32 GMT, Roman Kennke wrote: >> This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. >> >> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. >> >> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR >> >> ### Benchmarks >> >> All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. >> >> #### DaCapo/AArch64 >> >> Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? >> >> benchmark | baseline | fast-locking | % | size >> -- | -- | -- | -- | -- >> avrora | 27859 | 27563 | 1.07% | large >> batik | 20786 | 20847 | -0.29% | large >> biojava | 27421 | 27334 | 0.32% | default >> eclipse | 59918 | 60522 | -1.00% | large >> fop | 3670 | 3678 | -0.22% | default >> graphchi | 2088 | 2060 | 1.36% | default >> h2 | 297391 | 291292 | 2.09% | huge >> jme | 8762 | 8877 | -1.30% | default >> jython | 18938 | 18878 | 0.32% | default >> luindex | 1339 | 1325 | 1.06% | default >> lusearch | 918 | 936 | -1.92% | default >> pmd | 58291 | 58423 | -0.23% | large >> sunflow | 32617 | 24961 | 30.67% | large >> tomcat | 25481 | 25992 | -1.97% | large >> tradebeans | 314640 | 311706 | 0.94% | huge >> tradesoap | 107473 | 110246 | -2.52% | huge >> xalan | 6047 | 5882 | 2.81% | default >> zxing | 970 | 926 | 4.75% | default >> >> #### DaCapo/x86_64 >> >> The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. >> >> benchmark | baseline | fast-Locking | % | size >> -- | -- | -- | -- | -- >> avrora | 127690 | 126749 | 0.74% | large >> batik | 12736 | 12641 | 0.75% | large >> biojava | 15423 | 15404 | 0.12% | default >> eclipse | 41174 | 41498 | -0.78% | large >> fop | 2184 | 2172 | 0.55% | default >> graphchi | 1579 | 1560 | 1.22% | default >> h2 | 227614 | 230040 | -1.05% | huge >> jme | 8591 | 8398 | 2.30% | default >> jython | 13473 | 13356 | 0.88% | default >> luindex | 824 | 813 | 1.35% | default >> lusearch | 962 | 968 | -0.62% | default >> pmd | 40827 | 39654 | 2.96% | large >> sunflow | 53362 | 43475 | 22.74% | large >> tomcat | 27549 | 28029 | -1.71% | large >> tradebeans | 190757 | 190994 | -0.12% | huge >> tradesoap | 68099 | 67934 | 0.24% | huge >> xalan | 7969 | 8178 | -2.56% | default >> zxing | 1176 | 1148 | 2.44% | default >> >> #### Renaissance/AArch64 >> >> This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. >> >> benchmark | baseline | fast-locking | % >> -- | -- | -- | -- >> AkkaUct | 2558.832 | 2513.594 | 1.80% >> Reactors | 14715.626 | 14311.246 | 2.83% >> Als | 1851.485 | 1869.622 | -0.97% >> ChiSquare | 1007.788 | 1003.165 | 0.46% >> GaussMix | 1157.491 | 1149.969 | 0.65% >> LogRegression | 717.772 | 733.576 | -2.15% >> MovieLens | 7916.181 | 8002.226 | -1.08% >> NaiveBayes | 395.296 | 386.611 | 2.25% >> PageRank | 4294.939 | 4346.333 | -1.18% >> FjKmeans | 496.076 | 493.873 | 0.45% >> FutureGenetic | 2578.504 | 2589.255 | -0.42% >> Mnemonics | 4898.886 | 4903.689 | -0.10% >> ParMnemonics | 4260.507 | 4210.121 | 1.20% >> Scrabble | 139.37 | 138.312 | 0.76% >> RxScrabble | 320.114 | 322.651 | -0.79% >> Dotty | 1056.543 | 1068.492 | -1.12% >> ScalaDoku | 3443.117 | 3449.477 | -0.18% >> ScalaKmeans | 259.384 | 258.648 | 0.28% >> Philosophers | 24333.311 | 23438.22 | 3.82% >> ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% >> FinagleChirper | 6814.192 | 6853.38 | -0.57% >> FinagleHttp | 4762.902 | 4807.564 | -0.93% >> >> #### Renaissance/x86_64 >> >> benchmark | baseline | fast-locking | % >> -- | -- | -- | -- >> AkkaUct | 1117.185 | 1116.425 | 0.07% >> Reactors | 11561.354 | 11812.499 | -2.13% >> Als | 1580.838 | 1575.318 | 0.35% >> ChiSquare | 459.601 | 467.109 | -1.61% >> GaussMix | 705.944 | 685.595 | 2.97% >> LogRegression | 659.944 | 656.428 | 0.54% >> MovieLens | 7434.303 | 7592.271 | -2.08% >> NaiveBayes | 413.482 | 417.369 | -0.93% >> PageRank | 3259.233 | 3276.589 | -0.53% >> FjKmeans | 946.429 | 938.991 | 0.79% >> FutureGenetic | 1760.672 | 1815.272 | -3.01% >> ParMnemonics | 2016.917 | 2033.101 | -0.80% >> Scrabble | 147.996 | 150.084 | -1.39% >> RxScrabble | 177.755 | 177.956 | -0.11% >> Dotty | 673.754 | 683.919 | -1.49% >> ScalaDoku | 2193.562 | 1958.419 | 12.01% >> ScalaKmeans | 165.376 | 168.925 | -2.10% >> ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% >> Philosophers | 14268.449 | 13308.87 | 7.21% >> FinagleChirper | 4722.13 | 4688.3 | 0.72% >> FinagleHttp | 3497.241 | 3605.118 | -2.99% >> >> Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. >> >> I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). >> >> Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. >> >> ### Testing >> - [x] tier1 (x86_64, aarch64, x86_32) >> - [x] tier2 (x86_64, aarch64) >> - [x] tier3 (x86_64, aarch64) >> - [x] tier4 (x86_64, aarch64) > > Roman Kennke has updated the pull request incrementally with two additional commits since the last revision: > > - Merge remote-tracking branch 'origin/fast-locking' into fast-locking > - Re-use r0 in call to unlock_object() Here is the basic support for RISC-V: https://cr.openjdk.java.net/~shade/8291555/riscv-patch-1.patch -- I adapted this from AArch64 changes, and tested it very lightly. @RealFYang, can I leave the testing and follow up fixes to you? ------------- PR: https://git.openjdk.org/jdk/pull/10590 From rkennke at openjdk.org Thu Oct 13 07:33:48 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 13 Oct 2022 07:33:48 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v4] In-Reply-To: References: Message-ID: > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 496.076 | 493.873 | 0.45% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaKmeans | 259.384 | 258.648 | 0.28% > Philosophers | 24333.311 | 23438.22 | 3.82% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > ParMnemonics | 2016.917 | 2033.101 | -0.80% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaDoku | 2193.562 | 1958.419 | 12.01% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > Philosophers | 14268.449 | 13308.87 | 7.21% > FinagleChirper | 4722.13 | 4688.3 | 0.72% > FinagleHttp | 3497.241 | 3605.118 | -2.99% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: RISC-V port ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10590/files - new: https://git.openjdk.org/jdk/pull/10590/files/4ccdab8f..d9153be5 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10590&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10590&range=02-03 Stats: 368 lines in 11 files changed: 89 ins; 211 del; 68 mod Patch: https://git.openjdk.org/jdk/pull/10590.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10590/head:pull/10590 PR: https://git.openjdk.org/jdk/pull/10590 From rehn at openjdk.org Thu Oct 13 08:50:27 2022 From: rehn at openjdk.org (Robbin Ehn) Date: Thu, 13 Oct 2022 08:50:27 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v4] In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 07:33:48 GMT, Roman Kennke wrote: >> This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. >> >> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. >> >> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR >> >> ### Benchmarks >> >> All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. >> >> #### DaCapo/AArch64 >> >> Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? >> >> benchmark | baseline | fast-locking | % | size >> -- | -- | -- | -- | -- >> avrora | 27859 | 27563 | 1.07% | large >> batik | 20786 | 20847 | -0.29% | large >> biojava | 27421 | 27334 | 0.32% | default >> eclipse | 59918 | 60522 | -1.00% | large >> fop | 3670 | 3678 | -0.22% | default >> graphchi | 2088 | 2060 | 1.36% | default >> h2 | 297391 | 291292 | 2.09% | huge >> jme | 8762 | 8877 | -1.30% | default >> jython | 18938 | 18878 | 0.32% | default >> luindex | 1339 | 1325 | 1.06% | default >> lusearch | 918 | 936 | -1.92% | default >> pmd | 58291 | 58423 | -0.23% | large >> sunflow | 32617 | 24961 | 30.67% | large >> tomcat | 25481 | 25992 | -1.97% | large >> tradebeans | 314640 | 311706 | 0.94% | huge >> tradesoap | 107473 | 110246 | -2.52% | huge >> xalan | 6047 | 5882 | 2.81% | default >> zxing | 970 | 926 | 4.75% | default >> >> #### DaCapo/x86_64 >> >> The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. >> >> benchmark | baseline | fast-Locking | % | size >> -- | -- | -- | -- | -- >> avrora | 127690 | 126749 | 0.74% | large >> batik | 12736 | 12641 | 0.75% | large >> biojava | 15423 | 15404 | 0.12% | default >> eclipse | 41174 | 41498 | -0.78% | large >> fop | 2184 | 2172 | 0.55% | default >> graphchi | 1579 | 1560 | 1.22% | default >> h2 | 227614 | 230040 | -1.05% | huge >> jme | 8591 | 8398 | 2.30% | default >> jython | 13473 | 13356 | 0.88% | default >> luindex | 824 | 813 | 1.35% | default >> lusearch | 962 | 968 | -0.62% | default >> pmd | 40827 | 39654 | 2.96% | large >> sunflow | 53362 | 43475 | 22.74% | large >> tomcat | 27549 | 28029 | -1.71% | large >> tradebeans | 190757 | 190994 | -0.12% | huge >> tradesoap | 68099 | 67934 | 0.24% | huge >> xalan | 7969 | 8178 | -2.56% | default >> zxing | 1176 | 1148 | 2.44% | default >> >> #### Renaissance/AArch64 >> >> This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. >> >> benchmark | baseline | fast-locking | % >> -- | -- | -- | -- >> AkkaUct | 2558.832 | 2513.594 | 1.80% >> Reactors | 14715.626 | 14311.246 | 2.83% >> Als | 1851.485 | 1869.622 | -0.97% >> ChiSquare | 1007.788 | 1003.165 | 0.46% >> GaussMix | 1157.491 | 1149.969 | 0.65% >> LogRegression | 717.772 | 733.576 | -2.15% >> MovieLens | 7916.181 | 8002.226 | -1.08% >> NaiveBayes | 395.296 | 386.611 | 2.25% >> PageRank | 4294.939 | 4346.333 | -1.18% >> FjKmeans | 496.076 | 493.873 | 0.45% >> FutureGenetic | 2578.504 | 2589.255 | -0.42% >> Mnemonics | 4898.886 | 4903.689 | -0.10% >> ParMnemonics | 4260.507 | 4210.121 | 1.20% >> Scrabble | 139.37 | 138.312 | 0.76% >> RxScrabble | 320.114 | 322.651 | -0.79% >> Dotty | 1056.543 | 1068.492 | -1.12% >> ScalaDoku | 3443.117 | 3449.477 | -0.18% >> ScalaKmeans | 259.384 | 258.648 | 0.28% >> Philosophers | 24333.311 | 23438.22 | 3.82% >> ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% >> FinagleChirper | 6814.192 | 6853.38 | -0.57% >> FinagleHttp | 4762.902 | 4807.564 | -0.93% >> >> #### Renaissance/x86_64 >> >> benchmark | baseline | fast-locking | % >> -- | -- | -- | -- >> AkkaUct | 1117.185 | 1116.425 | 0.07% >> Reactors | 11561.354 | 11812.499 | -2.13% >> Als | 1580.838 | 1575.318 | 0.35% >> ChiSquare | 459.601 | 467.109 | -1.61% >> GaussMix | 705.944 | 685.595 | 2.97% >> LogRegression | 659.944 | 656.428 | 0.54% >> MovieLens | 7434.303 | 7592.271 | -2.08% >> NaiveBayes | 413.482 | 417.369 | -0.93% >> PageRank | 3259.233 | 3276.589 | -0.53% >> FjKmeans | 946.429 | 938.991 | 0.79% >> FutureGenetic | 1760.672 | 1815.272 | -3.01% >> ParMnemonics | 2016.917 | 2033.101 | -0.80% >> Scrabble | 147.996 | 150.084 | -1.39% >> RxScrabble | 177.755 | 177.956 | -0.11% >> Dotty | 673.754 | 683.919 | -1.49% >> ScalaDoku | 2193.562 | 1958.419 | 12.01% >> ScalaKmeans | 165.376 | 168.925 | -2.10% >> ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% >> Philosophers | 14268.449 | 13308.87 | 7.21% >> FinagleChirper | 4722.13 | 4688.3 | 0.72% >> FinagleHttp | 3497.241 | 3605.118 | -2.99% >> >> Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. >> >> I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). >> >> Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. >> >> ### Testing >> - [x] tier1 (x86_64, aarch64, x86_32) >> - [x] tier2 (x86_64, aarch64) >> - [x] tier3 (x86_64, aarch64) >> - [x] tier4 (x86_64, aarch64) > > Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: > > RISC-V port On aarch64 (linux and mac) I see these variations of crashes in random tests: # Internal Error .... src/hotspot/share/c1/c1_Runtime1.cpp:768), pid=2884803, tid=2884996 # assert(oopDesc::is_oop(oop(obj))) failed: must be NULL or an object: 0x000000000000dead # V [libjvm.so+0x7851d4] Runtime1::monitorexit(JavaThread*, oopDesc*)+0x110 # SIGSEGV (0xb) at pc=0x0000fffc9d4e3de8, pid=1842880, tid=1842994 # V [libjvm.so+0xbf3de8] SharedRuntime::monitor_exit_helper(oopDesc*, JavaThread*)+0x24 # SIGSEGV (0xb) at pc=0x0000fffca9f00394, pid=959883, tid=959927 # V [libjvm.so+0xc90394] ObjectSynchronizer::exit(oopDesc*, JavaThread*)+0x54 ------------- PR: https://git.openjdk.org/jdk/pull/10590 From rkennke at openjdk.org Thu Oct 13 10:35:16 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 13 Oct 2022 10:35:16 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v5] In-Reply-To: References: Message-ID: > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 496.076 | 493.873 | 0.45% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaKmeans | 259.384 | 258.648 | 0.28% > Philosophers | 24333.311 | 23438.22 | 3.82% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > ParMnemonics | 2016.917 | 2033.101 | -0.80% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaDoku | 2193.562 | 1958.419 | 12.01% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > Philosophers | 14268.449 | 13308.87 | 7.21% > FinagleChirper | 4722.13 | 4688.3 | 0.72% > FinagleHttp | 3497.241 | 3605.118 | -2.99% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) Roman Kennke has updated the pull request incrementally with two additional commits since the last revision: - Merge remote-tracking branch 'origin/fast-locking' into fast-locking - Revert "Re-use r0 in call to unlock_object()" This reverts commit ebbcb615a788998596f403b47b72cf133cb9de46. ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10590/files - new: https://git.openjdk.org/jdk/pull/10590/files/d9153be5..8d146b99 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10590&range=04 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10590&range=03-04 Stats: 7 lines in 3 files changed: 1 ins; 0 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/10590.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10590/head:pull/10590 PR: https://git.openjdk.org/jdk/pull/10590 From rkennke at openjdk.org Thu Oct 13 10:36:34 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 13 Oct 2022 10:36:34 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v4] In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 08:46:45 GMT, Robbin Ehn wrote: > On aarch64 (linux and mac) I see these variations of crashes in random tests: (asserts in debug, crash in release it looks like) > > ``` > # Internal Error .... src/hotspot/share/c1/c1_Runtime1.cpp:768), pid=2884803, tid=2884996 > # assert(oopDesc::is_oop(oop(obj))) failed: must be NULL or an object: 0x000000000000dead > # V [libjvm.so+0x7851d4] Runtime1::monitorexit(JavaThread*, oopDesc*)+0x110 > ``` > > ``` > # SIGSEGV (0xb) at pc=0x0000fffc9d4e3de8, pid=1842880, tid=1842994 > # V [libjvm.so+0xbf3de8] SharedRuntime::monitor_exit_helper(oopDesc*, JavaThread*)+0x24 > ``` > > ``` > # SIGSEGV (0xb) at pc=0x0000fffca9f00394, pid=959883, tid=959927 > # V [libjvm.so+0xc90394] ObjectSynchronizer::exit(oopDesc*, JavaThread*)+0x54 > ``` Ugh. That is most likely caused by the recent change: https://github.com/rkennke/jdk/commit/ebbcb615a788998596f403b47b72cf133cb9de46 It used to be very stable before that. I have backed out that change, can you try again? Thanks, Roman ------------- PR: https://git.openjdk.org/jdk/pull/10590 From rkennke at openjdk.org Thu Oct 13 10:42:03 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Thu, 13 Oct 2022 10:42:03 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v3] In-Reply-To: References: Message-ID: On Tue, 11 Oct 2022 20:41:57 GMT, Robbin Ehn wrote: > Regarding benchmarks, is it possible to get some indication what fast-locking+lillput result will be? FinagleHttp seems to suffer a bit, will Lillput give some/all of that back, or more? That particular benchmark, as some others, exhibit relatively high run-to-run variance. I have run it again many more times to average-out the variance, and I'm now getting the following results: baseline: 3503.844 ms/ops, fast-locking: 3546.344 ms/ops, percent: -1.20% That is still a slight regression, but with more confidence. Regarding Lilliput, I cannot really say at the moment. Some workloads are actually regressing with Lilliput, presumably because they are sensitive on the performance of loading the Klass* out of objects, and that is currently more complex in Lilliput (because it needs to coordinate with monitor locking). FinagleHttp seems to be one of those workloads. I am working to get rid of this limitation, and then I can be more specific. ------------- PR: https://git.openjdk.org/jdk/pull/10590 From fyang at openjdk.org Fri Oct 14 01:22:58 2022 From: fyang at openjdk.org (Fei Yang) Date: Fri, 14 Oct 2022 01:22:58 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v3] In-Reply-To: References: Message-ID: On Wed, 12 Oct 2022 11:26:16 GMT, Aleksey Shipilev wrote: > Here is the basic support for RISC-V: https://cr.openjdk.java.net/~shade/8291555/riscv-patch-1.patch > > -- I adapted this from AArch64 changes, and tested it very lightly. @RealFYang, can I leave the testing and follow up fixes to you? @shipilev : Sure, I am happy to to that! Thanks for porting this to RISC-V :-) ------------- PR: https://git.openjdk.org/jdk/pull/10590 From rehn at openjdk.org Fri Oct 14 06:45:08 2022 From: rehn at openjdk.org (Robbin Ehn) Date: Fri, 14 Oct 2022 06:45:08 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v4] In-Reply-To: References: Message-ID: On Thu, 13 Oct 2022 10:34:04 GMT, Roman Kennke wrote: > It used to be very stable before that. I have backed out that change, can you try again? Seems fine now, thanks. ------------- PR: https://git.openjdk.org/jdk/pull/10590 From fyang at openjdk.org Fri Oct 14 13:47:13 2022 From: fyang at openjdk.org (Fei Yang) Date: Fri, 14 Oct 2022 13:47:13 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v3] In-Reply-To: References: Message-ID: <05W6k3vqT1b5IGhd653G8zPjCbtiN7HFg8KzZsiMorQ=.38f418d5-540e-46af-a72c-9d6b4471428a@github.com> On Fri, 14 Oct 2022 01:19:27 GMT, Fei Yang wrote: > > Here is the basic support for RISC-V: https://cr.openjdk.java.net/~shade/8291555/riscv-patch-1.patch > > -- I adapted this from AArch64 changes, and tested it very lightly. @RealFYang, can I leave the testing and follow up fixes to you? > > @shipilev : Sure, I am happy to to that! Thanks for porting this to RISC-V :-) @shipilev : After applying this on today's jdk master, linux-riscv64 fastdebug fail to build on HiFive Unmatched. I see JVM crash happens during the build process. I suppose you carried out the test with some release build, right? ------------- PR: https://git.openjdk.org/jdk/pull/10590 From rkennke at openjdk.org Fri Oct 14 14:30:00 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Fri, 14 Oct 2022 14:30:00 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v3] In-Reply-To: <05W6k3vqT1b5IGhd653G8zPjCbtiN7HFg8KzZsiMorQ=.38f418d5-540e-46af-a72c-9d6b4471428a@github.com> References: <05W6k3vqT1b5IGhd653G8zPjCbtiN7HFg8KzZsiMorQ=.38f418d5-540e-46af-a72c-9d6b4471428a@github.com> Message-ID: On Fri, 14 Oct 2022 13:45:07 GMT, Fei Yang wrote: > > > Here is the basic support for RISC-V: https://cr.openjdk.java.net/~shade/8291555/riscv-patch-1.patch > > > -- I adapted this from AArch64 changes, and tested it very lightly. @RealFYang, can I leave the testing and follow up fixes to you? > > > > > > @shipilev : Sure, I am happy to to that! Thanks for porting this to RISC-V :-) > > @shipilev : After applying this on today's jdk master, linux-riscv64 fastdebug fail to build on HiFive Unmatched. I see JVM crash happens during the build process. I suppose you carried out the test with some release build, right? Have you applied the whole PR? Or only the patch that @shipilev provided. Because only the patch without the rest of the PR is bound to fail. ------------- PR: https://git.openjdk.org/jdk/pull/10590 From fyang at openjdk.org Fri Oct 14 14:35:07 2022 From: fyang at openjdk.org (Fei Yang) Date: Fri, 14 Oct 2022 14:35:07 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v3] In-Reply-To: References: <05W6k3vqT1b5IGhd653G8zPjCbtiN7HFg8KzZsiMorQ=.38f418d5-540e-46af-a72c-9d6b4471428a@github.com> Message-ID: <9KWs3-ICjuSPKWkcn-hTz0V2rMUrn8B6aqmE2spm5es=.cc94175e-a8f9-468a-991a-656ee2c8c581@github.com> On Fri, 14 Oct 2022 14:26:20 GMT, Roman Kennke wrote: > > > > Here is the basic support for RISC-V: https://cr.openjdk.java.net/~shade/8291555/riscv-patch-1.patch > > > > -- I adapted this from AArch64 changes, and tested it very lightly. @RealFYang, can I leave the testing and follow up fixes to you? > > > > > > > > > @shipilev : Sure, I am happy to to that! Thanks for porting this to RISC-V :-) > > > > > > @shipilev : After applying this on today's jdk master, linux-riscv64 fastdebug fail to build on HiFive Unmatched. I see JVM crash happens during the build process. I suppose you carried out the test with some release build, right? > > Have you applied the whole PR? Or only the patch that @shipilev provided. Because only the patch without the rest of the PR is bound to fail. Yes, the whole PR: https://patch-diff.githubusercontent.com/raw/openjdk/jdk/pull/10590.diff ------------- PR: https://git.openjdk.org/jdk/pull/10590 From rkennke at openjdk.org Fri Oct 14 14:41:07 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Fri, 14 Oct 2022 14:41:07 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v3] In-Reply-To: <9KWs3-ICjuSPKWkcn-hTz0V2rMUrn8B6aqmE2spm5es=.cc94175e-a8f9-468a-991a-656ee2c8c581@github.com> References: <05W6k3vqT1b5IGhd653G8zPjCbtiN7HFg8KzZsiMorQ=.38f418d5-540e-46af-a72c-9d6b4471428a@github.com> <9KWs3-ICjuSPKWkcn-hTz0V2rMUrn8B6aqmE2spm5es=.cc94175e-a8f9-468a-991a-656ee2c8c581@github.com> Message-ID: On Fri, 14 Oct 2022 14:32:57 GMT, Fei Yang wrote: > > > > > Here is the basic support for RISC-V: https://cr.openjdk.java.net/~shade/8291555/riscv-patch-1.patch > > > > > -- I adapted this from AArch64 changes, and tested it very lightly. @RealFYang, can I leave the testing and follow up fixes to you? > > > > > > > > > > > > @shipilev : Sure, I am happy to to that! Thanks for porting this to RISC-V :-) > > > > > > > > > @shipilev : After applying this on today's jdk master, linux-riscv64 fastdebug fail to build on HiFive Unmatched. I see JVM crash happens during the build process. I suppose you carried out the test with some release build, right? > > > > > > Have you applied the whole PR? Or only the patch that @shipilev provided. Because only the patch without the rest of the PR is bound to fail. > > Yes, the whole PR: https://patch-diff.githubusercontent.com/raw/openjdk/jdk/pull/10590.diff The PR reports a merge conflict in risc-v code, when applied vs latest tip. Have you resolved that? GHA (which includes risc-v) is happy, otherwise. ------------- PR: https://git.openjdk.org/jdk/pull/10590 From fyang at openjdk.org Fri Oct 14 14:56:11 2022 From: fyang at openjdk.org (Fei Yang) Date: Fri, 14 Oct 2022 14:56:11 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v3] In-Reply-To: References: <05W6k3vqT1b5IGhd653G8zPjCbtiN7HFg8KzZsiMorQ=.38f418d5-540e-46af-a72c-9d6b4471428a@github.com> <9KWs3-ICjuSPKWkcn-hTz0V2rMUrn8B6aqmE2spm5es=.cc94175e-a8f9-468a-991a-656ee2c8c581@github.com> Message-ID: On Fri, 14 Oct 2022 14:39:01 GMT, Roman Kennke wrote: > > > > > > Here is the basic support for RISC-V: https://cr.openjdk.java.net/~shade/8291555/riscv-patch-1.patch > > > > > > -- I adapted this from AArch64 changes, and tested it very lightly. @RealFYang, can I leave the testing and follow up fixes to you? > > > > > > > > > > > > > > > @shipilev : Sure, I am happy to to that! Thanks for porting this to RISC-V :-) > > > > > > > > > > > > @shipilev : After applying this on today's jdk master, linux-riscv64 fastdebug fail to build on HiFive Unmatched. I see JVM crash happens during the build process. I suppose you carried out the test with some release build, right? > > > > > > > > > Have you applied the whole PR? Or only the patch that @shipilev provided. Because only the patch without the rest of the PR is bound to fail. > > > > > > Yes, the whole PR: https://patch-diff.githubusercontent.com/raw/openjdk/jdk/pull/10590.diff > > The PR reports a merge conflict in risc-v code, when applied vs latest tip. Have you resolved that? GHA (which includes risc-v) is happy, otherwise. @rkennke : I did see some "Hunk succeeded" messages for the risc-v part when applying the change with: $ patch -p1 < ~/10590.diff But I didn't check whether that will cause a problem here. patching file src/hotspot/cpu/riscv/c1_CodeStubs_riscv.cpp patching file src/hotspot/cpu/riscv/c1_LIRAssembler_riscv.cpp patching file src/hotspot/cpu/riscv/c1_LIRGenerator_riscv.cpp patching file src/hotspot/cpu/riscv/c1_MacroAssembler_riscv.cpp Hunk #1 succeeded at 58 (offset -1 lines). Hunk #2 succeeded at 67 (offset -1 lines). patching file src/hotspot/cpu/riscv/c1_Runtime1_riscv.cpp patching file src/hotspot/cpu/riscv/interp_masm_riscv.cpp patching file src/hotspot/cpu/riscv/macroAssembler_riscv.cpp Hunk #1 succeeded at 2499 (offset 324 lines). Hunk #2 succeeded at 4474 (offset 330 lines). patching file src/hotspot/cpu/riscv/macroAssembler_riscv.hpp Hunk #1 succeeded at 869 with fuzz 2 (offset 313 lines). Hunk #2 succeeded at 1252 (offset 325 lines). patching file src/hotspot/cpu/riscv/riscv.ad Hunk #1 succeeded at 2385 (offset 7 lines). Hunk #2 succeeded at 2407 (offset 7 lines). Hunk #3 succeeded at 2433 (offset 7 lines). Hunk #4 succeeded at 10403 (offset 33 lines). Hunk #5 succeeded at 10417 (offset 33 lines). patching file src/hotspot/cpu/riscv/sharedRuntime_riscv.cpp Hunk #1 succeeded at 975 (offset 21 lines). Hunk #2 succeeded at 1030 (offset 21 lines). Hunk #3 succeeded at 1042 (offset 21 lines). Hunk #4 succeeded at 1058 (offset 21 lines). Hunk #5 succeeded at 1316 (offset 24 lines). Hunk #6 succeeded at 1416 (offset 24 lines). Hunk #7 succeeded at 1492 (offset 24 lines). Hunk #8 succeeded at 1517 (offset 24 lines). Hunk #9 succeeded at 1621 (offset 24 lines). ------------- PR: https://git.openjdk.org/jdk/pull/10590 From rkennke at openjdk.org Fri Oct 14 15:42:01 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Fri, 14 Oct 2022 15:42:01 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v3] In-Reply-To: References: <05W6k3vqT1b5IGhd653G8zPjCbtiN7HFg8KzZsiMorQ=.38f418d5-540e-46af-a72c-9d6b4471428a@github.com> <9KWs3-ICjuSPKWkcn-hTz0V2rMUrn8B6aqmE2spm5es=.cc94175e-a8f9-468a-991a-656ee2c8c581@github.com> Message-ID: <2abWu-ITUoN-hNBTy6f0qQN-Q5XuAF3XXbTe7Kz63iU=.350a2155-f2ef-4909-98d8-350306413f74@github.com> On Fri, 14 Oct 2022 14:53:57 GMT, Fei Yang wrote: > > > > > > > Here is the basic support for RISC-V: https://cr.openjdk.java.net/~shade/8291555/riscv-patch-1.patch > > > > > > > -- I adapted this from AArch64 changes, and tested it very lightly. @RealFYang, can I leave the testing and follow up fixes to you? > > > > > > > > > > > > > > > > > > @shipilev : Sure, I am happy to to that! Thanks for porting this to RISC-V :-) > > > > > > > > > > > > > > > @shipilev : After applying this on today's jdk master, linux-riscv64 fastdebug fail to build on HiFive Unmatched. I see JVM crash happens during the build process. I suppose you carried out the test with some release build, right? > > > > > > > > > > > > Have you applied the whole PR? Or only the patch that @shipilev provided. Because only the patch without the rest of the PR is bound to fail. > > > > > > > > > Yes, the whole PR: https://patch-diff.githubusercontent.com/raw/openjdk/jdk/pull/10590.diff > > > > > > The PR reports a merge conflict in risc-v code, when applied vs latest tip. Have you resolved that? GHA (which includes risc-v) is happy, otherwise. > > @rkennke : I did see some "Hunk succeeded" messages for the risc-v part when applying the change with: $ patch -p1 < ~/10590.diff But I didn't check whether that will cause a problem here. If you take the latest code from this PR, it would already have the patch applied. No need to patch it again. ------------- PR: https://git.openjdk.org/jdk/pull/10590 From fyang at openjdk.org Mon Oct 17 04:33:08 2022 From: fyang at openjdk.org (Fei Yang) Date: Mon, 17 Oct 2022 04:33:08 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v3] In-Reply-To: <2abWu-ITUoN-hNBTy6f0qQN-Q5XuAF3XXbTe7Kz63iU=.350a2155-f2ef-4909-98d8-350306413f74@github.com> References: <05W6k3vqT1b5IGhd653G8zPjCbtiN7HFg8KzZsiMorQ=.38f418d5-540e-46af-a72c-9d6b4471428a@github.com> <9KWs3-ICjuSPKWkcn-hTz0V2rMUrn8B6aqmE2spm5es=.cc94175e-a8f9-468a-991a-656ee2c8c581@github.com> <2abWu-ITUoN-hNBTy6f0qQN-Q5XuAF3XXbTe7Kz63iU=.350a2155-f2ef-4909-98d8-350306413f74@github.com> Message-ID: On Fri, 14 Oct 2022 15:39:41 GMT, Roman Kennke wrote: >>> > > > > > Here is the basic support for RISC-V: https://cr.openjdk.java.net/~shade/8291555/riscv-patch-1.patch >>> > > > > > -- I adapted this from AArch64 changes, and tested it very lightly. @RealFYang, can I leave the testing and follow up fixes to you? >>> > > > > >>> > > > > >>> > > > > @shipilev : Sure, I am happy to to that! Thanks for porting this to RISC-V :-) >>> > > > >>> > > > >>> > > > @shipilev : After applying this on today's jdk master, linux-riscv64 fastdebug fail to build on HiFive Unmatched. I see JVM crash happens during the build process. I suppose you carried out the test with some release build, right? >>> > > >>> > > >>> > > Have you applied the whole PR? Or only the patch that @shipilev provided. Because only the patch without the rest of the PR is bound to fail. >>> > >>> > >>> > Yes, the whole PR: https://patch-diff.githubusercontent.com/raw/openjdk/jdk/pull/10590.diff >>> >>> The PR reports a merge conflict in risc-v code, when applied vs latest tip. Have you resolved that? GHA (which includes risc-v) is happy, otherwise. >> >> @rkennke : >> I did see some "Hunk succeeded" messages for the risc-v part when applying the change with: $ patch -p1 < ~/10590.diff >> But I didn't check whether that will cause a problem here. >> >> >> patching file src/hotspot/cpu/riscv/c1_CodeStubs_riscv.cpp >> patching file src/hotspot/cpu/riscv/c1_LIRAssembler_riscv.cpp >> patching file src/hotspot/cpu/riscv/c1_LIRGenerator_riscv.cpp >> patching file src/hotspot/cpu/riscv/c1_MacroAssembler_riscv.cpp >> Hunk #1 succeeded at 58 (offset -1 lines). >> Hunk #2 succeeded at 67 (offset -1 lines). >> patching file src/hotspot/cpu/riscv/c1_Runtime1_riscv.cpp >> patching file src/hotspot/cpu/riscv/interp_masm_riscv.cpp >> patching file src/hotspot/cpu/riscv/macroAssembler_riscv.cpp >> Hunk #1 succeeded at 2499 (offset 324 lines). >> Hunk #2 succeeded at 4474 (offset 330 lines). >> patching file src/hotspot/cpu/riscv/macroAssembler_riscv.hpp >> Hunk #1 succeeded at 869 with fuzz 2 (offset 313 lines). >> Hunk #2 succeeded at 1252 (offset 325 lines). >> patching file src/hotspot/cpu/riscv/riscv.ad >> Hunk #1 succeeded at 2385 (offset 7 lines). >> Hunk #2 succeeded at 2407 (offset 7 lines). >> Hunk #3 succeeded at 2433 (offset 7 lines). >> Hunk #4 succeeded at 10403 (offset 33 lines). >> Hunk #5 succeeded at 10417 (offset 33 lines). >> patching file src/hotspot/cpu/riscv/sharedRuntime_riscv.cpp >> Hunk #1 succeeded at 975 (offset 21 lines). >> Hunk #2 succeeded at 1030 (offset 21 lines). >> Hunk #3 succeeded at 1042 (offset 21 lines). >> Hunk #4 succeeded at 1058 (offset 21 lines). >> Hunk #5 succeeded at 1316 (offset 24 lines). >> Hunk #6 succeeded at 1416 (offset 24 lines). >> Hunk #7 succeeded at 1492 (offset 24 lines). >> Hunk #8 succeeded at 1517 (offset 24 lines). >> Hunk #9 succeeded at 1621 (offset 24 lines). > >> > > > > > > Here is the basic support for RISC-V: https://cr.openjdk.java.net/~shade/8291555/riscv-patch-1.patch >> > > > > > > -- I adapted this from AArch64 changes, and tested it very lightly. @RealFYang, can I leave the testing and follow up fixes to you? >> > > > > > >> > > > > > >> > > > > > @shipilev : Sure, I am happy to to that! Thanks for porting this to RISC-V :-) >> > > > > >> > > > > >> > > > > @shipilev : After applying this on today's jdk master, linux-riscv64 fastdebug fail to build on HiFive Unmatched. I see JVM crash happens during the build process. I suppose you carried out the test with some release build, right? >> > > > >> > > > >> > > > Have you applied the whole PR? Or only the patch that @shipilev provided. Because only the patch without the rest of the PR is bound to fail. >> > > >> > > >> > > Yes, the whole PR: https://patch-diff.githubusercontent.com/raw/openjdk/jdk/pull/10590.diff >> > >> > >> > The PR reports a merge conflict in risc-v code, when applied vs latest tip. Have you resolved that? GHA (which includes risc-v) is happy, otherwise. >> >> @rkennke : I did see some "Hunk succeeded" messages for the risc-v part when applying the change with: $ patch -p1 < ~/10590.diff But I didn't check whether that will cause a problem here. > > If you take the latest code from this PR, it would already have the patch applied. No need to patch it again. @rkennke : Could you please add this follow-up fix for RISC-V? I can build fastdebug on HiFive Unmatched board with this fix now and run non-trivial benchmark workloads. I will carry out more tests. [riscv-patch-2.txt](https://github.com/openjdk/jdk/files/9796886/riscv-patch-2.txt) ------------- PR: https://git.openjdk.org/jdk/pull/10590 From rkennke at openjdk.org Mon Oct 17 10:13:13 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Mon, 17 Oct 2022 10:13:13 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v6] In-Reply-To: References: Message-ID: > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 496.076 | 493.873 | 0.45% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaKmeans | 259.384 | 258.648 | 0.28% > Philosophers | 24333.311 | 23438.22 | 3.82% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > ParMnemonics | 2016.917 | 2033.101 | -0.80% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaDoku | 2193.562 | 1958.419 | 12.01% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > Philosophers | 14268.449 | 13308.87 | 7.21% > FinagleChirper | 4722.13 | 4688.3 | 0.72% > FinagleHttp | 3497.241 | 3605.118 | -2.99% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) Roman Kennke has updated the pull request incrementally with one additional commit since the last revision: More RISC-V fixes ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10590/files - new: https://git.openjdk.org/jdk/pull/10590/files/8d146b99..57403ad1 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10590&range=05 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10590&range=04-05 Stats: 37 lines in 5 files changed: 0 ins; 8 del; 29 mod Patch: https://git.openjdk.org/jdk/pull/10590.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10590/head:pull/10590 PR: https://git.openjdk.org/jdk/pull/10590 From shade at openjdk.org Mon Oct 17 18:27:44 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Mon, 17 Oct 2022 18:27:44 GMT Subject: RFR: 8294438: Fix misleading-indentation warnings in hotspot [v2] In-Reply-To: References: Message-ID: > There are number of places where misleading-indentation is reported by GCC. Currently, the warning is disabled for the entirety of Hotspot, which is not good. > > C1 does an unusual style here. Changing it globally would touch a lot of lines. Instead of doing that, I fit the existing style while also resolving the warnings. Note this actually solves a bug in `lir_alloc_array`, where `do_temp` are called without a check. > > Build-tested this with product of: > - GCC 10 > - {i686, x86_64, aarch64, powerpc64le, s390x, armhf, riscv64} > - {server, zero} > - {release, fastdebug} > > Linux x86_64 fastdebug `tier1` is fine. Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains four commits: - Merge branch 'master' into JDK-8294438-misleading-indentation - Merge branch 'master' into JDK-8294438-misleading-indentation - Also javaClasses.cpp - Fix ------------- Changes: https://git.openjdk.org/jdk/pull/10444/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10444&range=01 Stats: 56 lines in 5 files changed: 7 ins; 20 del; 29 mod Patch: https://git.openjdk.org/jdk/pull/10444.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10444/head:pull/10444 PR: https://git.openjdk.org/jdk/pull/10444 From stefank at openjdk.org Tue Oct 18 13:04:57 2022 From: stefank at openjdk.org (Stefan Karlsson) Date: Tue, 18 Oct 2022 13:04:57 GMT Subject: RFR: 8295475: Move non-resource allocation strategies out of ResourceObj Message-ID: <4RakidFUe7jYYkY_1XkaBRuwJCxPd90CO1trC7QNzno=.18335453-ebc7-42b3-8973-d2ffefc47b53@github.com> Background to this patch: This prototype/patch has been discussed with a few HotSpot devs, and I've gotten feedback that I should send it out for broader discussion/review. It could be a first step to make it easier to talk about our allocation super classes and strategies. This in turn would make it easier to have further discussions around how to make our allocation strategies more flexible. E.g. do we really need to tie down utility classes to a specific allocation strategy? Do we really have to provide MEMFLAGS as compile time flags? Etc. PR RFC: HotSpot has a few allocation classes that other classes can inherit from to get different dynamic-allocation strategies: MetaspaceObj - allocates in the Metaspace CHeap - uses malloc ResourceObj - ... The last class sounds like it provide an allocation strategy to allocate inside a thread's resource area. This is true, but it also provides functions to allow the instances to be allocated in Areanas or even CHeap allocated memory. This is IMHO misleading, and often leads to confusion among HotSpot developers. I propose that we simplify ResourceObj to only provide an allocation strategy for resource allocations, and move the multi-allocation strategy feature to another class, which isn't named ResourceObj. In my proposal and prototype I've used the name AnyObj, as short, simple name. I'm open to changing the name to something else. The patch also adds a new class named ArenaObj, which is for objects only allocated in provided arenas. The patch also removes the need to provide ResourceObj/AnyObj::C_HEAP to `operator new`. If you pass in a MEMFLAGS argument it now means that you want to allocate on the CHeap. ------------- Commit messages: - Remove AnyObj new operator taking an allocation_type - Use more specific allocation types Changes: https://git.openjdk.org/jdk/pull/10745/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10745&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295475 Stats: 458 lines in 152 files changed: 67 ins; 37 del; 354 mod Patch: https://git.openjdk.org/jdk/pull/10745.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10745/head:pull/10745 PR: https://git.openjdk.org/jdk/pull/10745 From stefank at openjdk.org Tue Oct 18 13:42:40 2022 From: stefank at openjdk.org (Stefan Karlsson) Date: Tue, 18 Oct 2022 13:42:40 GMT Subject: RFR: 8295475: Move non-resource allocation strategies out of ResourceObj [v2] In-Reply-To: <4RakidFUe7jYYkY_1XkaBRuwJCxPd90CO1trC7QNzno=.18335453-ebc7-42b3-8973-d2ffefc47b53@github.com> References: <4RakidFUe7jYYkY_1XkaBRuwJCxPd90CO1trC7QNzno=.18335453-ebc7-42b3-8973-d2ffefc47b53@github.com> Message-ID: > Background to this patch: > > This prototype/patch has been discussed with a few HotSpot devs, and I've gotten feedback that I should send it out for broader discussion/review. It could be a first step to make it easier to talk about our allocation super classes and strategies. This in turn would make it easier to have further discussions around how to make our allocation strategies more flexible. E.g. do we really need to tie down utility classes to a specific allocation strategy? Do we really have to provide MEMFLAGS as compile time flags? Etc. > > PR RFC: > > HotSpot has a few allocation classes that other classes can inherit from to get different dynamic-allocation strategies: > > MetaspaceObj - allocates in the Metaspace > CHeap - uses malloc > ResourceObj - ... > > The last class sounds like it provide an allocation strategy to allocate inside a thread's resource area. This is true, but it also provides functions to allow the instances to be allocated in Areanas or even CHeap allocated memory. > > This is IMHO misleading, and often leads to confusion among HotSpot developers. > > I propose that we simplify ResourceObj to only provide an allocation strategy for resource allocations, and move the multi-allocation strategy feature to another class, which isn't named ResourceObj. > > In my proposal and prototype I've used the name AnyObj, as short, simple name. I'm open to changing the name to something else. > > The patch also adds a new class named ArenaObj, which is for objects only allocated in provided arenas. > > The patch also removes the need to provide ResourceObj/AnyObj::C_HEAP to `operator new`. If you pass in a MEMFLAGS argument it now means that you want to allocate on the CHeap. Stefan Karlsson has updated the pull request incrementally with one additional commit since the last revision: Fix Shenandoah ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10745/files - new: https://git.openjdk.org/jdk/pull/10745/files/bafa0229..4e8ac797 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10745&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10745&range=00-01 Stats: 4 lines in 4 files changed: 0 ins; 0 del; 4 mod Patch: https://git.openjdk.org/jdk/pull/10745.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10745/head:pull/10745 PR: https://git.openjdk.org/jdk/pull/10745 From ngasson at openjdk.org Wed Oct 19 17:05:26 2022 From: ngasson at openjdk.org (Nick Gasson) Date: Wed, 19 Oct 2022 17:05:26 GMT Subject: RFR: 8294775: Shenandoah: reduce contention on _threads_in_evac [v2] In-Reply-To: References: Message-ID: > The idea here is to reduce contention on the shared `_threads_in_evac` counter by splitting its state over multiple independent cache lines. Each thread hashes to one particular counter based on its `Thread*`. This helps improve throughput of concurrent evacuation where many Java threads may be attempting to update this counter on the load barrier slow path. > > See this earlier thread for details and SPECjbb results: https://mail.openjdk.org/pipermail/shenandoah-dev/2022-October/017494.html > > Also tested `hotspot_gc_shenandoah` on x86 and AArch64. Nick Gasson has updated the pull request incrementally with one additional commit since the last revision: Refactor ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10573/files - new: https://git.openjdk.org/jdk/pull/10573/files/2303fbed..14cec5ed Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10573&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10573&range=00-01 Stats: 184 lines in 3 files changed: 109 ins; 45 del; 30 mod Patch: https://git.openjdk.org/jdk/pull/10573.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10573/head:pull/10573 PR: https://git.openjdk.org/jdk/pull/10573 From ngasson at openjdk.org Wed Oct 19 17:05:26 2022 From: ngasson at openjdk.org (Nick Gasson) Date: Wed, 19 Oct 2022 17:05:26 GMT Subject: RFR: 8294775: Shenandoah: reduce contention on _threads_in_evac [v2] In-Reply-To: References: Message-ID: <8AVo6UxTFFl8JY9i2Oy9XrWm50OUYEhFYOuv1-7mslA=.6995e072-5dca-4366-a7b4-a80245b4ff98@github.com> On Tue, 11 Oct 2022 11:57:55 GMT, Roman Kennke wrote: >> Nick Gasson has updated the pull request incrementally with one additional commit since the last revision: >> >> Refactor > > src/hotspot/share/gc/shenandoah/shenandoahEvacOOMHandler.hpp line 92: > >> 90: shenandoah_padding(0); >> 91: struct { >> 92: volatile jint bits; > > The bits field needs a comment saying that it combines a counter with an OOM bit. In-fact, it would probably benefit from a little bit of refactoring, make it a class, and move accessors and relevant methods into it, and avoid public access to the field? I'm not sure if it's exactly what you intended, but I had a go at refactoring it in the last commit. ------------- PR: https://git.openjdk.org/jdk/pull/10573 From shade at openjdk.org Wed Oct 19 19:11:42 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Wed, 19 Oct 2022 19:11:42 GMT Subject: RFR: 8294438: Fix misleading-indentation warnings in hotspot [v3] In-Reply-To: References: Message-ID: > There are number of places where misleading-indentation is reported by GCC. Currently, the warning is disabled for the entirety of Hotspot, which is not good. > > C1 does an unusual style here. Changing it globally would touch a lot of lines. Instead of doing that, I fit the existing style while also resolving the warnings. Note this actually solves a bug in `lir_alloc_array`, where `do_temp` are called without a check. > > Build-tested this with product of: > - GCC 10 > - {i686, x86_64, aarch64, powerpc64le, s390x, armhf, riscv64} > - {server, zero} > - {release, fastdebug} > > Linux x86_64 fastdebug `tier1` is fine. Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: - Merge branch 'master' into JDK-8294438-misleading-indentation - Merge branch 'master' into JDK-8294438-misleading-indentation - Merge branch 'master' into JDK-8294438-misleading-indentation - Also javaClasses.cpp - Fix ------------- Changes: https://git.openjdk.org/jdk/pull/10444/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10444&range=02 Stats: 56 lines in 5 files changed: 7 ins; 20 del; 29 mod Patch: https://git.openjdk.org/jdk/pull/10444.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10444/head:pull/10444 PR: https://git.openjdk.org/jdk/pull/10444 From shade at openjdk.org Thu Oct 20 07:16:55 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 20 Oct 2022 07:16:55 GMT Subject: RFR: 8294438: Fix misleading-indentation warnings in hotspot [v3] In-Reply-To: References: Message-ID: On Wed, 19 Oct 2022 19:11:42 GMT, Aleksey Shipilev wrote: >> There are number of places where misleading-indentation is reported by GCC. Currently, the warning is disabled for the entirety of Hotspot, which is not good. >> >> C1 does an unusual style here. Changing it globally would touch a lot of lines. Instead of doing that, I fit the existing style while also resolving the warnings. Note this actually solves a bug in `lir_alloc_array`, where `do_temp` are called without a check. >> >> Build-tested this with product of: >> - GCC 10 >> - {i686, x86_64, aarch64, powerpc64le, s390x, armhf, riscv64} >> - {server, zero} >> - {release, fastdebug} >> >> Linux x86_64 fastdebug `tier1` is fine. > > Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: > > - Merge branch 'master' into JDK-8294438-misleading-indentation > - Merge branch 'master' into JDK-8294438-misleading-indentation > - Merge branch 'master' into JDK-8294438-misleading-indentation > - Also javaClasses.cpp > - Fix Thanks! ------------- PR: https://git.openjdk.org/jdk/pull/10444 From shade at openjdk.org Thu Oct 20 07:21:03 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 20 Oct 2022 07:21:03 GMT Subject: Integrated: 8294438: Fix misleading-indentation warnings in hotspot In-Reply-To: References: Message-ID: On Tue, 27 Sep 2022 10:28:54 GMT, Aleksey Shipilev wrote: > There are number of places where misleading-indentation is reported by GCC. Currently, the warning is disabled for the entirety of Hotspot, which is not good. > > C1 does an unusual style here. Changing it globally would touch a lot of lines. Instead of doing that, I fit the existing style while also resolving the warnings. Note this actually solves a bug in `lir_alloc_array`, where `do_temp` are called without a check. > > Build-tested this with product of: > - GCC 10 > - {i686, x86_64, aarch64, powerpc64le, s390x, armhf, riscv64} > - {server, zero} > - {release, fastdebug} > > Linux x86_64 fastdebug `tier1` is fine. This pull request has now been integrated. Changeset: 545021b1 Author: Aleksey Shipilev URL: https://git.openjdk.org/jdk/commit/545021b18d6f82ac8013009939ef4e05b8ebf7ce Stats: 56 lines in 5 files changed: 7 ins; 20 del; 29 mod 8294438: Fix misleading-indentation warnings in hotspot Reviewed-by: ihse, dholmes, coleenp ------------- PR: https://git.openjdk.org/jdk/pull/10444 From dholmes at openjdk.org Thu Oct 20 07:34:00 2022 From: dholmes at openjdk.org (David Holmes) Date: Thu, 20 Oct 2022 07:34:00 GMT Subject: RFR: 8294438: Fix misleading-indentation warnings in hotspot [v3] In-Reply-To: References: Message-ID: <_mUxIObAiNsVdxaC637-8aaIwXyHFc5xaBGu_phRn_0=.b5cf975b-3649-4ad2-82d6-8de11ef09bf8@github.com> On Thu, 20 Oct 2022 07:14:27 GMT, Aleksey Shipilev wrote: >> Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: >> >> - Merge branch 'master' into JDK-8294438-misleading-indentation >> - Merge branch 'master' into JDK-8294438-misleading-indentation >> - Merge branch 'master' into JDK-8294438-misleading-indentation >> - Also javaClasses.cpp >> - Fix > > Thanks! @shipilev this has broken our linux aarch64 builds! [2022-10-20T07:26:59,542Z] workspace/open/src/hotspot/cpu/aarch64/assembler_aarch64.cpp: In member function 'void Address::lea(MacroAssembler*, Register) const': [2022-10-20T07:26:59,542Z] workspace/open/src/hotspot/cpu/aarch64/assembler_aarch64.cpp:138:5: error: this 'else' clause does not guard... [-Werror=misleading-indentation] [2022-10-20T07:26:59,542Z] 138 | else [2022-10-20T07:26:59,542Z] | ^~~~ [2022-10-20T07:26:59,542Z] workspace/open/src/hotspot/cpu/aarch64/assembler_aarch64.cpp:140:7: note: ...this statement, but the latter is misleadingly indented as if it were guarded by the 'else' [2022-10-20T07:26:59,542Z] 140 | break; [2022-10-20T07:26:59,542Z] | ^~~~~ ------------- PR: https://git.openjdk.org/jdk/pull/10444 From shade at openjdk.org Thu Oct 20 07:37:13 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 20 Oct 2022 07:37:13 GMT Subject: RFR: 8294438: Fix misleading-indentation warnings in hotspot [v3] In-Reply-To: References: Message-ID: On Thu, 20 Oct 2022 07:14:27 GMT, Aleksey Shipilev wrote: >> Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains five commits: >> >> - Merge branch 'master' into JDK-8294438-misleading-indentation >> - Merge branch 'master' into JDK-8294438-misleading-indentation >> - Merge branch 'master' into JDK-8294438-misleading-indentation >> - Also javaClasses.cpp >> - Fix > > Thanks! > @shipilev this has broken our linux aarch64 builds! Whoa. Looking. ------------- PR: https://git.openjdk.org/jdk/pull/10444 From shade at openjdk.org Thu Oct 20 07:44:03 2022 From: shade at openjdk.org (Aleksey Shipilev) Date: Thu, 20 Oct 2022 07:44:03 GMT Subject: RFR: 8294438: Fix misleading-indentation warnings in hotspot [v3] In-Reply-To: References: Message-ID: On Thu, 20 Oct 2022 07:34:29 GMT, Aleksey Shipilev wrote: > > @shipilev this has broken our linux aarch64 builds! > > Whoa. Looking. That would be: #10781 ------------- PR: https://git.openjdk.org/jdk/pull/10444 From jsjolen at openjdk.org Fri Oct 21 09:58:33 2022 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Fri, 21 Oct 2022 09:58:33 GMT Subject: RFR: 8294954: Remove superfluous ResourceMarks when using LogStream In-Reply-To: References: Message-ID: On Fri, 7 Oct 2022 11:19:55 GMT, Johan Sj?len wrote: > Hi, > > I went through all of the places where LogStreams are created and removed the unnecessary ResourceMarks. I also added a ResourceMark in one place, where it was needed because of a call to `::name_and_sig_as_C_string` and moved one to the smallest scope where it is used. I put back the ResourceMark in `VM_Operation::evaluate` as looking through each VM Operation for unprotected resource usage is infeasible. ------------- PR: https://git.openjdk.org/jdk/pull/10602 From jsjolen at openjdk.org Fri Oct 21 09:58:32 2022 From: jsjolen at openjdk.org (Johan =?UTF-8?B?U2rDtmxlbg==?=) Date: Fri, 21 Oct 2022 09:58:32 GMT Subject: RFR: 8294954: Remove superfluous ResourceMarks when using LogStream [v2] In-Reply-To: References: Message-ID: > Hi, > > I went through all of the places where LogStreams are created and removed the unnecessary ResourceMarks. I also added a ResourceMark in one place, where it was needed because of a call to `::name_and_sig_as_C_string` and moved one to the smallest scope where it is used. Johan Sj?len has updated the pull request incrementally with one additional commit since the last revision: Put back VM_Operation::evaluate ResourceMark ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10602/files - new: https://git.openjdk.org/jdk/pull/10602/files/bfa88acb..ab939bf8 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10602&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10602&range=00-01 Stats: 1 line in 1 file changed: 1 ins; 0 del; 0 mod Patch: https://git.openjdk.org/jdk/pull/10602.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10602/head:pull/10602 PR: https://git.openjdk.org/jdk/pull/10602 From rkennke at openjdk.org Mon Oct 24 08:03:13 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Mon, 24 Oct 2022 08:03:13 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v7] In-Reply-To: References: Message-ID: > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 496.076 | 493.873 | 0.45% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaKmeans | 259.384 | 258.648 | 0.28% > Philosophers | 24333.311 | 23438.22 | 3.82% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > ParMnemonics | 2016.917 | 2033.101 | -0.80% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaDoku | 2193.562 | 1958.419 | 12.01% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > Philosophers | 14268.449 | 13308.87 | 7.21% > FinagleChirper | 4722.13 | 4688.3 | 0.72% > FinagleHttp | 3497.241 | 3605.118 | -2.99% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) > - [x] jcstress 3-days -t sync -af GLOBAL (x86_64, aarch64) Roman Kennke has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 35 commits: - Merge remote-tracking branch 'upstream/master' into fast-locking - More RISC-V fixes - Merge remote-tracking branch 'origin/fast-locking' into fast-locking - RISC-V port - Revert "Re-use r0 in call to unlock_object()" This reverts commit ebbcb615a788998596f403b47b72cf133cb9de46. - Merge remote-tracking branch 'origin/fast-locking' into fast-locking - Fix number of rt args to complete_monitor_locking_C, remove some comments - Re-use r0 in call to unlock_object() - Merge tag 'jdk-20+17' into fast-locking Added tag jdk-20+17 for changeset 79ccc791 - Fix OSR packing in AArch64, part 2 - ... and 25 more: https://git.openjdk.org/jdk/compare/65c84e0c...a67eb95e ------------- Changes: https://git.openjdk.org/jdk/pull/10590/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10590&range=06 Stats: 4031 lines in 137 files changed: 731 ins; 2703 del; 597 mod Patch: https://git.openjdk.org/jdk/pull/10590.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10590/head:pull/10590 PR: https://git.openjdk.org/jdk/pull/10590 From rehn at openjdk.org Mon Oct 24 11:04:16 2022 From: rehn at openjdk.org (Robbin Ehn) Date: Mon, 24 Oct 2022 11:04:16 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v7] In-Reply-To: References: Message-ID: On Mon, 24 Oct 2022 08:03:13 GMT, Roman Kennke wrote: >> This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. >> >> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. >> >> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR >> >> ### Benchmarks >> >> All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. >> >> #### DaCapo/AArch64 >> >> Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? >> >> benchmark | baseline | fast-locking | % | size >> -- | -- | -- | -- | -- >> avrora | 27859 | 27563 | 1.07% | large >> batik | 20786 | 20847 | -0.29% | large >> biojava | 27421 | 27334 | 0.32% | default >> eclipse | 59918 | 60522 | -1.00% | large >> fop | 3670 | 3678 | -0.22% | default >> graphchi | 2088 | 2060 | 1.36% | default >> h2 | 297391 | 291292 | 2.09% | huge >> jme | 8762 | 8877 | -1.30% | default >> jython | 18938 | 18878 | 0.32% | default >> luindex | 1339 | 1325 | 1.06% | default >> lusearch | 918 | 936 | -1.92% | default >> pmd | 58291 | 58423 | -0.23% | large >> sunflow | 32617 | 24961 | 30.67% | large >> tomcat | 25481 | 25992 | -1.97% | large >> tradebeans | 314640 | 311706 | 0.94% | huge >> tradesoap | 107473 | 110246 | -2.52% | huge >> xalan | 6047 | 5882 | 2.81% | default >> zxing | 970 | 926 | 4.75% | default >> >> #### DaCapo/x86_64 >> >> The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. >> >> benchmark | baseline | fast-Locking | % | size >> -- | -- | -- | -- | -- >> avrora | 127690 | 126749 | 0.74% | large >> batik | 12736 | 12641 | 0.75% | large >> biojava | 15423 | 15404 | 0.12% | default >> eclipse | 41174 | 41498 | -0.78% | large >> fop | 2184 | 2172 | 0.55% | default >> graphchi | 1579 | 1560 | 1.22% | default >> h2 | 227614 | 230040 | -1.05% | huge >> jme | 8591 | 8398 | 2.30% | default >> jython | 13473 | 13356 | 0.88% | default >> luindex | 824 | 813 | 1.35% | default >> lusearch | 962 | 968 | -0.62% | default >> pmd | 40827 | 39654 | 2.96% | large >> sunflow | 53362 | 43475 | 22.74% | large >> tomcat | 27549 | 28029 | -1.71% | large >> tradebeans | 190757 | 190994 | -0.12% | huge >> tradesoap | 68099 | 67934 | 0.24% | huge >> xalan | 7969 | 8178 | -2.56% | default >> zxing | 1176 | 1148 | 2.44% | default >> >> #### Renaissance/AArch64 >> >> This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. >> >> benchmark | baseline | fast-locking | % >> -- | -- | -- | -- >> AkkaUct | 2558.832 | 2513.594 | 1.80% >> Reactors | 14715.626 | 14311.246 | 2.83% >> Als | 1851.485 | 1869.622 | -0.97% >> ChiSquare | 1007.788 | 1003.165 | 0.46% >> GaussMix | 1157.491 | 1149.969 | 0.65% >> LogRegression | 717.772 | 733.576 | -2.15% >> MovieLens | 7916.181 | 8002.226 | -1.08% >> NaiveBayes | 395.296 | 386.611 | 2.25% >> PageRank | 4294.939 | 4346.333 | -1.18% >> FjKmeans | 496.076 | 493.873 | 0.45% >> FutureGenetic | 2578.504 | 2589.255 | -0.42% >> Mnemonics | 4898.886 | 4903.689 | -0.10% >> ParMnemonics | 4260.507 | 4210.121 | 1.20% >> Scrabble | 139.37 | 138.312 | 0.76% >> RxScrabble | 320.114 | 322.651 | -0.79% >> Dotty | 1056.543 | 1068.492 | -1.12% >> ScalaDoku | 3443.117 | 3449.477 | -0.18% >> ScalaKmeans | 259.384 | 258.648 | 0.28% >> Philosophers | 24333.311 | 23438.22 | 3.82% >> ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% >> FinagleChirper | 6814.192 | 6853.38 | -0.57% >> FinagleHttp | 4762.902 | 4807.564 | -0.93% >> >> #### Renaissance/x86_64 >> >> benchmark | baseline | fast-locking | % >> -- | -- | -- | -- >> AkkaUct | 1117.185 | 1116.425 | 0.07% >> Reactors | 11561.354 | 11812.499 | -2.13% >> Als | 1580.838 | 1575.318 | 0.35% >> ChiSquare | 459.601 | 467.109 | -1.61% >> GaussMix | 705.944 | 685.595 | 2.97% >> LogRegression | 659.944 | 656.428 | 0.54% >> MovieLens | 7434.303 | 7592.271 | -2.08% >> NaiveBayes | 413.482 | 417.369 | -0.93% >> PageRank | 3259.233 | 3276.589 | -0.53% >> FjKmeans | 946.429 | 938.991 | 0.79% >> FutureGenetic | 1760.672 | 1815.272 | -3.01% >> ParMnemonics | 2016.917 | 2033.101 | -0.80% >> Scrabble | 147.996 | 150.084 | -1.39% >> RxScrabble | 177.755 | 177.956 | -0.11% >> Dotty | 673.754 | 683.919 | -1.49% >> ScalaDoku | 2193.562 | 1958.419 | 12.01% >> ScalaKmeans | 165.376 | 168.925 | -2.10% >> ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% >> Philosophers | 14268.449 | 13308.87 | 7.21% >> FinagleChirper | 4722.13 | 4688.3 | 0.72% >> FinagleHttp | 3497.241 | 3605.118 | -2.99% >> >> Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. >> >> I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). >> >> Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. >> >> ### Testing >> - [x] tier1 (x86_64, aarch64, x86_32) >> - [x] tier2 (x86_64, aarch64) >> - [x] tier3 (x86_64, aarch64) >> - [x] tier4 (x86_64, aarch64) >> - [x] jcstress 3-days -t sync -af GLOBAL (x86_64, aarch64) > > Roman Kennke has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 35 commits: > > - Merge remote-tracking branch 'upstream/master' into fast-locking > - More RISC-V fixes > - Merge remote-tracking branch 'origin/fast-locking' into fast-locking > - RISC-V port > - Revert "Re-use r0 in call to unlock_object()" > > This reverts commit ebbcb615a788998596f403b47b72cf133cb9de46. > - Merge remote-tracking branch 'origin/fast-locking' into fast-locking > - Fix number of rt args to complete_monitor_locking_C, remove some comments > - Re-use r0 in call to unlock_object() > - Merge tag 'jdk-20+17' into fast-locking > > Added tag jdk-20+17 for changeset 79ccc791 > - Fix OSR packing in AArch64, part 2 > - ... and 25 more: https://git.openjdk.org/jdk/compare/65c84e0c...a67eb95e First the "SharedRuntime::complete_monitor_locking_C" crash do not reproduce. Secondly, a question/suggestion: Many recursive cases do not interleave locks, meaning the recursive enter will happen with the lock/oop top of lock stack already. Why not peak at top lock/oop in lock-stack if the is current just push it again and the locking is done? (instead of inflating) (exit would need to check if this is the last one and then proper exit) Worried about the size of the lock-stack? ------------- PR: https://git.openjdk.org/jdk/pull/10590 From wkemper at openjdk.org Mon Oct 24 21:20:05 2022 From: wkemper at openjdk.org (William Kemper) Date: Mon, 24 Oct 2022 21:20:05 GMT Subject: RFR: Merge openjdk/jdk:master Message-ID: This tag included a change to the `CardTable` (base class for `ShenandoahCardTable`) that required a couple of commits to fix up. ------------- Commit messages: - Do not search past last valid index for objects in clean cards - Remove '-1' for '+1' removed in JDK-8292912 - Merge tag 'jdk-20+18' into upstream-merge-test - 8294869: Correct failure of RemovedJDKInternals.java after JDK-8294618 - 8294397: Replace StringBuffer with StringBuilder within java.text - 8294734: Redundant override in AES implementation - 8294618: Update openjdk.java.net => openjdk.org - 8294840: langtools OptionalDependencyTest.java use File.pathSeparator - 8289925: Shared code shouldn't reference the platform specific method frame::interpreter_frame_last_sp() - 8282900: runtime/stringtable/StringTableCleaningTest.java verify unavailable at this moment - ... and 504 more: https://git.openjdk.org/shenandoah/compare/a3799c8d...1b8006a7 The webrevs contain the adjustments done while merging with regards to each parent branch: - master: https://webrevs.openjdk.org/?repo=shenandoah&pr=163&range=00.0 - openjdk/jdk:master: https://webrevs.openjdk.org/?repo=shenandoah&pr=163&range=00.1 Changes: https://git.openjdk.org/shenandoah/pull/163/files Stats: 122056 lines in 2353 files changed: 58668 ins; 49130 del; 14258 mod Patch: https://git.openjdk.org/shenandoah/pull/163.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/163/head:pull/163 PR: https://git.openjdk.org/shenandoah/pull/163 From wkemper at openjdk.org Mon Oct 24 22:38:33 2022 From: wkemper at openjdk.org (William Kemper) Date: Mon, 24 Oct 2022 22:38:33 GMT Subject: RFR: Merge openjdk/jdk:master [v2] In-Reply-To: References: Message-ID: > This tag included a change to the `CardTable` (base class for `ShenandoahCardTable`) that required a couple of commits to fix up. William Kemper has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 187 commits: - Do not search past last valid index for objects in clean cards - Remove '-1' for '+1' removed in JDK-8292912 - Merge tag 'jdk-20+18' into upstream-merge-test Added tag jdk-20+18 for changeset 0ec18382b - Resolve merge issues - Merge branch 'shenandoah-master' into upstream-merge-test - Shenandoah unified logging Reviewed-by: wkemper, shade - Fix off-by-one error when verifying object registrations Reviewed-by: kdnilsen - Merge openjdk/jdk:master - Log rotation Reviewed-by: wkemper - Use only up to ConcGCThreads for concurrent RS scanning. Reviewed-by: kdnilsen, wkemper - ... and 177 more: https://git.openjdk.org/shenandoah/compare/0ec18382...1b8006a7 ------------- Changes: https://git.openjdk.org/shenandoah/pull/163/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=163&range=01 Stats: 14378 lines in 142 files changed: 13158 ins; 486 del; 734 mod Patch: https://git.openjdk.org/shenandoah/pull/163.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/163/head:pull/163 PR: https://git.openjdk.org/shenandoah/pull/163 From wkemper at openjdk.org Mon Oct 24 22:38:35 2022 From: wkemper at openjdk.org (William Kemper) Date: Mon, 24 Oct 2022 22:38:35 GMT Subject: Integrated: Merge openjdk/jdk:master In-Reply-To: References: Message-ID: On Mon, 24 Oct 2022 21:04:53 GMT, William Kemper wrote: > This tag included a change to the `CardTable` (base class for `ShenandoahCardTable`) that required a couple of commits to fix up. This pull request has now been integrated. Changeset: 79a4bd18 Author: William Kemper URL: https://git.openjdk.org/shenandoah/commit/79a4bd18f5b21a17818b4083bd78c890e93ff09a Stats: 122056 lines in 2353 files changed: 58668 ins; 49130 del; 14258 mod Merge openjdk/jdk:master ------------- PR: https://git.openjdk.org/shenandoah/pull/163 From wkemper at openjdk.org Tue Oct 25 16:04:17 2022 From: wkemper at openjdk.org (William Kemper) Date: Tue, 25 Oct 2022 16:04:17 GMT Subject: RFR: Merge openjdk/jdk:master Message-ID: <00hiPXNi9cTaqcd-zmuj1fkkncRaKPRJu2DRgDT8Okc=.f4a8fcbe-5ad9-4135-ad9e-956f48b8ce52@github.com> Merge tag jdk-20+20, minor conflict in shenandoahControlThread.cpp. Looks good in test pipelines. ------------- Commit messages: - Merge branch 'upstream-merge-test' into merge-jdk20-20 - Merge jdk-20+20 - 8294467: Fix sequence-point warnings in Hotspot - 8294468: Fix char-subscripts warnings in Hotspot - 8295662: jdk/incubator/vector tests fail "assert(VM_Version::supports_avx512vlbw()) failed" - 8295668: validate-source failure after JDK-8290011 - 8295372: CompactNumberFormat handling of number one with decimal part - 8295456: (ch) sun.nio.ch.Util::checkBufferPositionAligned gives misleading/incorrect error - 8290011: IGV: Remove dead code and cleanup - 8290368: Introduce LDAP and RMI protocol-specific object factory filters to JNDI implementation - ... and 205 more: https://git.openjdk.org/shenandoah/compare/79a4bd18...8fc3ccdf The webrevs contain the adjustments done while merging with regards to each parent branch: - master: https://webrevs.openjdk.org/?repo=shenandoah&pr=164&range=00.0 - openjdk/jdk:master: https://webrevs.openjdk.org/?repo=shenandoah&pr=164&range=00.1 Changes: https://git.openjdk.org/shenandoah/pull/164/files Stats: 31261 lines in 1203 files changed: 18280 ins; 8098 del; 4883 mod Patch: https://git.openjdk.org/shenandoah/pull/164.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/164/head:pull/164 PR: https://git.openjdk.org/shenandoah/pull/164 From wkemper at openjdk.org Tue Oct 25 16:05:56 2022 From: wkemper at openjdk.org (William Kemper) Date: Tue, 25 Oct 2022 16:05:56 GMT Subject: Integrated: Merge openjdk/jdk:master In-Reply-To: <00hiPXNi9cTaqcd-zmuj1fkkncRaKPRJu2DRgDT8Okc=.f4a8fcbe-5ad9-4135-ad9e-956f48b8ce52@github.com> References: <00hiPXNi9cTaqcd-zmuj1fkkncRaKPRJu2DRgDT8Okc=.f4a8fcbe-5ad9-4135-ad9e-956f48b8ce52@github.com> Message-ID: On Tue, 25 Oct 2022 15:57:36 GMT, William Kemper wrote: > Merge tag jdk-20+20, minor conflict in shenandoahControlThread.cpp. Looks good in test pipelines. This pull request has now been integrated. Changeset: 067173a6 Author: William Kemper URL: https://git.openjdk.org/shenandoah/commit/067173a6dafff7341af9116152ddd861aa825dc7 Stats: 31261 lines in 1203 files changed: 18280 ins; 8098 del; 4883 mod Merge openjdk/jdk:master ------------- PR: https://git.openjdk.org/shenandoah/pull/164 From iveresov at openjdk.org Tue Oct 25 20:00:26 2022 From: iveresov at openjdk.org (Igor Veresov) Date: Tue, 25 Oct 2022 20:00:26 GMT Subject: RFR: 8295066: Folding of loads is broken in C2 after JDK-8242115 Message-ID: The fix does two things: 1. Allow folding of pinned loads to constants with a straight line data flow (no phis). 2. Make scalarization aware of the new shape of the barriers so that pre-loads can be ignored. Testing is clean, Valhalla testing is clean too. ------------- Commit messages: - Add test - Fix scalarization - Allow direct constant folding Changes: https://git.openjdk.org/jdk/pull/10861/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10861&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295066 Stats: 260 lines in 9 files changed: 178 ins; 46 del; 36 mod Patch: https://git.openjdk.org/jdk/pull/10861.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10861/head:pull/10861 PR: https://git.openjdk.org/jdk/pull/10861 From kvn at openjdk.org Tue Oct 25 22:06:17 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Tue, 25 Oct 2022 22:06:17 GMT Subject: RFR: 8295066: Folding of loads is broken in C2 after JDK-8242115 In-Reply-To: References: Message-ID: <5VWY6hlnoGyt8nqJMnX14qp7bpCvm4G1enchLM6NGT8=.f3a1b91d-12fb-4422-99ff-cc0dcbf669c5@github.com> On Tue, 25 Oct 2022 19:50:10 GMT, Igor Veresov wrote: > The fix does two things: > > 1. Allow folding of pinned loads to constants with a straight line data flow (no phis). > 2. Make scalarization aware of the new shape of the barriers so that pre-loads can be ignored. > > Testing is clean, Valhalla testing is clean too. Looks good. Please, test full first 3 tier1-3 (not just hs-tier*). ------------- Marked as reviewed by kvn (Reviewer). PR: https://git.openjdk.org/jdk/pull/10861 From iveresov at openjdk.org Wed Oct 26 04:19:23 2022 From: iveresov at openjdk.org (Igor Veresov) Date: Wed, 26 Oct 2022 04:19:23 GMT Subject: RFR: 8295066: Folding of loads is broken in C2 after JDK-8242115 In-Reply-To: <5VWY6hlnoGyt8nqJMnX14qp7bpCvm4G1enchLM6NGT8=.f3a1b91d-12fb-4422-99ff-cc0dcbf669c5@github.com> References: <5VWY6hlnoGyt8nqJMnX14qp7bpCvm4G1enchLM6NGT8=.f3a1b91d-12fb-4422-99ff-cc0dcbf669c5@github.com> Message-ID: On Tue, 25 Oct 2022 22:02:54 GMT, Vladimir Kozlov wrote: > Please, test full first 3 tier1-3 (not just hs-tier*). Done. Looks good. ------------- PR: https://git.openjdk.org/jdk/pull/10861 From kvn at openjdk.org Wed Oct 26 04:47:24 2022 From: kvn at openjdk.org (Vladimir Kozlov) Date: Wed, 26 Oct 2022 04:47:24 GMT Subject: RFR: 8295066: Folding of loads is broken in C2 after JDK-8242115 In-Reply-To: References: <5VWY6hlnoGyt8nqJMnX14qp7bpCvm4G1enchLM6NGT8=.f3a1b91d-12fb-4422-99ff-cc0dcbf669c5@github.com> Message-ID: On Wed, 26 Oct 2022 04:15:39 GMT, Igor Veresov wrote: > > Please, test full first 3 tier1-3 (not just hs-tier*). > > Done. Looks good. Thank you for running them. ------------- PR: https://git.openjdk.org/jdk/pull/10861 From thartmann at openjdk.org Wed Oct 26 05:24:23 2022 From: thartmann at openjdk.org (Tobias Hartmann) Date: Wed, 26 Oct 2022 05:24:23 GMT Subject: RFR: 8295066: Folding of loads is broken in C2 after JDK-8242115 In-Reply-To: References: Message-ID: <8rFROVmvN4pO0mGVlXs48VNkJ1c0D7UpiBarJIz7QJg=.31693a6d-f908-473b-bedb-f7cd824efb63@github.com> On Tue, 25 Oct 2022 19:50:10 GMT, Igor Veresov wrote: > The fix does two things: > > 1. Allow folding of pinned loads to constants with a straight line data flow (no phis). > 2. Make scalarization aware of the new shape of the barriers so that pre-loads can be ignored. > > Testing is clean, Valhalla testing is clean too. That looks good to me too. ------------- Marked as reviewed by thartmann (Reviewer). PR: https://git.openjdk.org/jdk/pull/10861 From iveresov at openjdk.org Wed Oct 26 20:49:33 2022 From: iveresov at openjdk.org (Igor Veresov) Date: Wed, 26 Oct 2022 20:49:33 GMT Subject: RFR: 8295066: Folding of loads is broken in C2 after JDK-8242115 In-Reply-To: References: Message-ID: <3judUFx-evWUXwoahXsErqBiA8XwbQpBLuJQU4HqnSE=.a7e1d5e8-e18c-4d1e-9d8b-f69bfc6f045e@github.com> On Tue, 25 Oct 2022 19:50:10 GMT, Igor Veresov wrote: > The fix does two things: > > 1. Allow folding of pinned loads to constants with a straight line data flow (no phis). > 2. Make scalarization aware of the new shape of the barriers so that pre-loads can be ignored. > > Testing is clean, Valhalla testing is clean too. Thanks for the reviews! ------------- PR: https://git.openjdk.org/jdk/pull/10861 From iveresov at openjdk.org Wed Oct 26 20:49:34 2022 From: iveresov at openjdk.org (Igor Veresov) Date: Wed, 26 Oct 2022 20:49:34 GMT Subject: Integrated: 8295066: Folding of loads is broken in C2 after JDK-8242115 In-Reply-To: References: Message-ID: On Tue, 25 Oct 2022 19:50:10 GMT, Igor Veresov wrote: > The fix does two things: > > 1. Allow folding of pinned loads to constants with a straight line data flow (no phis). > 2. Make scalarization aware of the new shape of the barriers so that pre-loads can be ignored. > > Testing is clean, Valhalla testing is clean too. This pull request has now been integrated. Changeset: 58a7141a Author: Igor Veresov URL: https://git.openjdk.org/jdk/commit/58a7141a0dea5d1b4bfe6d56a95d860c854b3461 Stats: 260 lines in 9 files changed: 178 ins; 46 del; 36 mod 8295066: Folding of loads is broken in C2 after JDK-8242115 Reviewed-by: kvn, thartmann ------------- PR: https://git.openjdk.org/jdk/pull/10861 From dcubed at openjdk.org Thu Oct 27 19:57:37 2022 From: dcubed at openjdk.org (Daniel D. Daugherty) Date: Thu, 27 Oct 2022 19:57:37 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v7] In-Reply-To: References: Message-ID: <_4pEeoarSDmRHeH3FOYwQz8RHokONrWwGdNIxv7Kpjo=.d82f866d-abd6-45b4-b7b0-9bd27a06294f@github.com> On Mon, 24 Oct 2022 08:03:13 GMT, Roman Kennke wrote: >> This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. >> >> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. >> >> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR >> >> ### Benchmarks >> >> All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. >> >> #### DaCapo/AArch64 >> >> Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? >> >> benchmark | baseline | fast-locking | % | size >> -- | -- | -- | -- | -- >> avrora | 27859 | 27563 | 1.07% | large >> batik | 20786 | 20847 | -0.29% | large >> biojava | 27421 | 27334 | 0.32% | default >> eclipse | 59918 | 60522 | -1.00% | large >> fop | 3670 | 3678 | -0.22% | default >> graphchi | 2088 | 2060 | 1.36% | default >> h2 | 297391 | 291292 | 2.09% | huge >> jme | 8762 | 8877 | -1.30% | default >> jython | 18938 | 18878 | 0.32% | default >> luindex | 1339 | 1325 | 1.06% | default >> lusearch | 918 | 936 | -1.92% | default >> pmd | 58291 | 58423 | -0.23% | large >> sunflow | 32617 | 24961 | 30.67% | large >> tomcat | 25481 | 25992 | -1.97% | large >> tradebeans | 314640 | 311706 | 0.94% | huge >> tradesoap | 107473 | 110246 | -2.52% | huge >> xalan | 6047 | 5882 | 2.81% | default >> zxing | 970 | 926 | 4.75% | default >> >> #### DaCapo/x86_64 >> >> The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. >> >> benchmark | baseline | fast-Locking | % | size >> -- | -- | -- | -- | -- >> avrora | 127690 | 126749 | 0.74% | large >> batik | 12736 | 12641 | 0.75% | large >> biojava | 15423 | 15404 | 0.12% | default >> eclipse | 41174 | 41498 | -0.78% | large >> fop | 2184 | 2172 | 0.55% | default >> graphchi | 1579 | 1560 | 1.22% | default >> h2 | 227614 | 230040 | -1.05% | huge >> jme | 8591 | 8398 | 2.30% | default >> jython | 13473 | 13356 | 0.88% | default >> luindex | 824 | 813 | 1.35% | default >> lusearch | 962 | 968 | -0.62% | default >> pmd | 40827 | 39654 | 2.96% | large >> sunflow | 53362 | 43475 | 22.74% | large >> tomcat | 27549 | 28029 | -1.71% | large >> tradebeans | 190757 | 190994 | -0.12% | huge >> tradesoap | 68099 | 67934 | 0.24% | huge >> xalan | 7969 | 8178 | -2.56% | default >> zxing | 1176 | 1148 | 2.44% | default >> >> #### Renaissance/AArch64 >> >> This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. >> >> benchmark | baseline | fast-locking | % >> -- | -- | -- | -- >> AkkaUct | 2558.832 | 2513.594 | 1.80% >> Reactors | 14715.626 | 14311.246 | 2.83% >> Als | 1851.485 | 1869.622 | -0.97% >> ChiSquare | 1007.788 | 1003.165 | 0.46% >> GaussMix | 1157.491 | 1149.969 | 0.65% >> LogRegression | 717.772 | 733.576 | -2.15% >> MovieLens | 7916.181 | 8002.226 | -1.08% >> NaiveBayes | 395.296 | 386.611 | 2.25% >> PageRank | 4294.939 | 4346.333 | -1.18% >> FjKmeans | 496.076 | 493.873 | 0.45% >> FutureGenetic | 2578.504 | 2589.255 | -0.42% >> Mnemonics | 4898.886 | 4903.689 | -0.10% >> ParMnemonics | 4260.507 | 4210.121 | 1.20% >> Scrabble | 139.37 | 138.312 | 0.76% >> RxScrabble | 320.114 | 322.651 | -0.79% >> Dotty | 1056.543 | 1068.492 | -1.12% >> ScalaDoku | 3443.117 | 3449.477 | -0.18% >> ScalaKmeans | 259.384 | 258.648 | 0.28% >> Philosophers | 24333.311 | 23438.22 | 3.82% >> ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% >> FinagleChirper | 6814.192 | 6853.38 | -0.57% >> FinagleHttp | 4762.902 | 4807.564 | -0.93% >> >> #### Renaissance/x86_64 >> >> benchmark | baseline | fast-locking | % >> -- | -- | -- | -- >> AkkaUct | 1117.185 | 1116.425 | 0.07% >> Reactors | 11561.354 | 11812.499 | -2.13% >> Als | 1580.838 | 1575.318 | 0.35% >> ChiSquare | 459.601 | 467.109 | -1.61% >> GaussMix | 705.944 | 685.595 | 2.97% >> LogRegression | 659.944 | 656.428 | 0.54% >> MovieLens | 7434.303 | 7592.271 | -2.08% >> NaiveBayes | 413.482 | 417.369 | -0.93% >> PageRank | 3259.233 | 3276.589 | -0.53% >> FjKmeans | 946.429 | 938.991 | 0.79% >> FutureGenetic | 1760.672 | 1815.272 | -3.01% >> ParMnemonics | 2016.917 | 2033.101 | -0.80% >> Scrabble | 147.996 | 150.084 | -1.39% >> RxScrabble | 177.755 | 177.956 | -0.11% >> Dotty | 673.754 | 683.919 | -1.49% >> ScalaDoku | 2193.562 | 1958.419 | 12.01% >> ScalaKmeans | 165.376 | 168.925 | -2.10% >> ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% >> Philosophers | 14268.449 | 13308.87 | 7.21% >> FinagleChirper | 4722.13 | 4688.3 | 0.72% >> FinagleHttp | 3497.241 | 3605.118 | -2.99% >> >> Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. >> >> I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). >> >> Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. >> >> ### Testing >> - [x] tier1 (x86_64, aarch64, x86_32) >> - [x] tier2 (x86_64, aarch64) >> - [x] tier3 (x86_64, aarch64) >> - [x] tier4 (x86_64, aarch64) >> - [x] jcstress 3-days -t sync -af GLOBAL (x86_64, aarch64) > > Roman Kennke has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 35 commits: > > - Merge remote-tracking branch 'upstream/master' into fast-locking > - More RISC-V fixes > - Merge remote-tracking branch 'origin/fast-locking' into fast-locking > - RISC-V port > - Revert "Re-use r0 in call to unlock_object()" > > This reverts commit ebbcb615a788998596f403b47b72cf133cb9de46. > - Merge remote-tracking branch 'origin/fast-locking' into fast-locking > - Fix number of rt args to complete_monitor_locking_C, remove some comments > - Re-use r0 in call to unlock_object() > - Merge tag 'jdk-20+17' into fast-locking > > Added tag jdk-20+17 for changeset 79ccc791 > - Fix OSR packing in AArch64, part 2 > - ... and 25 more: https://git.openjdk.org/jdk/compare/65c84e0c...a67eb95e This PR has been in "merge-conflict" state for about 10 days. When do you plan to merge again with the jdk/jdk repo? ------------- PR: https://git.openjdk.org/jdk/pull/10590 From jrose at openjdk.org Thu Oct 27 20:41:44 2022 From: jrose at openjdk.org (John R Rose) Date: Thu, 27 Oct 2022 20:41:44 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v7] In-Reply-To: References: Message-ID: On Mon, 24 Oct 2022 11:01:01 GMT, Robbin Ehn wrote: > Secondly, a question/suggestion: Many recursive cases do not interleave locks, meaning the recursive enter will happen with the lock/oop top of lock stack already. Why not peak at top lock/oop in lock-stack if the is current just push it again and the locking is done? (instead of inflating) (exit would need to check if this is the last one and then proper exit) The CJM paper (Dice/Kogan 2021) mentions a "nesting" counter for this purpose. I suspect that a real counter is overkill, and the "unary" representation Robbin mentions would be fine, especially if there were a point (when the per-thread stack gets too big) at which we go and inflate anyway. The CJM paper suggests a full search of the per-thread array to detect the recursive condition, but again I like Robbin's idea of checking only the most recent lock record. So the data structure for lock records (per thread) could consist of a series of distinct values [ A B C ] and each of the values could be repeated, but only adjacently: [ A A A B C C ] for example. And there could be a depth limit as well. Any sequence of held locks not expressible within those limitations could go to inflation as a backup. ------------- PR: https://git.openjdk.org/jdk/pull/10590 From mcimadamore at openjdk.org Thu Oct 27 21:00:07 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Thu, 27 Oct 2022 21:00:07 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) Message-ID: This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. [1] - https://openjdk.org/jeps/434 ------------- Commit messages: - Merge branch 'master' into PR_20 - Merge pull request #14 from minborg/small-javadoc - Update some javadocs - Revert some javadoc changes - Merge branch 'master' into PR_20 - Fix benchmark and test failure - Merge pull request #13 from minborg/revert-factories - Update javadocs after comments - Revert MemorySegment factories - Merge pull request #12 from minborg/fix-lookup-find - ... and 6 more: https://git.openjdk.org/jdk/compare/78454b69...ac7733da Changes: https://git.openjdk.org/jdk/pull/10872/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10872&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8295044 Stats: 10527 lines in 200 files changed: 4754 ins; 3539 del; 2234 mod Patch: https://git.openjdk.org/jdk/pull/10872.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10872/head:pull/10872 PR: https://git.openjdk.org/jdk/pull/10872 From mcimadamore at openjdk.org Thu Oct 27 21:00:07 2022 From: mcimadamore at openjdk.org (Maurizio Cimadamore) Date: Thu, 27 Oct 2022 21:00:07 GMT Subject: RFR: 8295044: Implementation of Foreign Function and Memory API (Second Preview) In-Reply-To: References: Message-ID: On Wed, 26 Oct 2022 13:11:50 GMT, Maurizio Cimadamore wrote: > This PR contains the API and implementation changes for JEP-434 [1]. A more detailed description of such changes, to avoid repetitions during the review process, is included as a separate comment. > > [1] - https://openjdk.org/jeps/434 Here are the main API changes introduced in this round (there are also some JVM changes which will be integrated separately): * The main change is the removal of `MemoryAddress` and `Addressable`. Instead, *zero-length memory segments* are used whenever the API needs to model "raw" addresses coming from native code. This simplifies the API, removing an ambiguous abstraction as well as some duplication in the API (see accessor methods in `MemoryAddress`); * To allow for "unsafe" access of zero-length memory segments, a new method has been added to `ValueLayout.OfAddress`, namely `asUnbounded`. This new restricted method takes an address layout and creates a new unbounded address layout. When using an unbounded layout to dereference memory, or construct downcall method handles, the API will create memory segments with maximal length (i.e. `Long.MAX_VALUE`, rather than zero-length memory segments, which can therefore be accessed; * The `MemoryLayout` hierarchy has been improved in several ways. First, the hierarchy is now defined in terms of sealed interfaces (intermediate abstract classes have been moved into the implementation package). The hierarchy is also exhaustive now, and works much better to pattern matching. More specifically, three new types have been added: `PaddingLayout`, `StructLayout` and `UnionLayout`, the latter two are a subtype of `GroupLayout`. Thanks to this move, several predicate methods (`isPadding`, `isStruct`, `isUnion`) have been dropped from the API; * The `SymbolLookup::lookup` method has been renamed to `SymbolLookup::find` - to avoid using the same word `lookup` in both noun and verb form, which leads to confusion; * A new method, on `ModuleLayer.Controller` has been added to enable native access on a module in a custom layer; * The new interface `Linker.Option` has been introduced. This is a tag interface accepted in `Linker::downcallHandle`. At the moment, only a single option is provided, to specify variadic function calls (because of this, the `FunctionDescriptor` interface has been simplified, and is now a simple carrier of arguments/return layouts). More linker options will follow. Javadoc: http://cr.openjdk.java.net/~mcimadamore/jdk/8295044/v1/javadoc/java.base/java/lang/foreign/package-summary.html ------------- PR: https://git.openjdk.org/jdk/pull/10872 From dholmes at openjdk.org Fri Oct 28 01:49:31 2022 From: dholmes at openjdk.org (David Holmes) Date: Fri, 28 Oct 2022 01:49:31 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v7] In-Reply-To: References: Message-ID: <6KaO6YDJAQZSps49h6TddX8-aXFEfOFCfLgpi1_90Ag=.d7fe0ac9-d392-4784-a13e-85f5212e00f1@github.com> On Thu, 27 Oct 2022 20:38:57 GMT, John R Rose wrote: > So the data structure for lock records (per thread) could consist of a series of distinct values [ A B C ] and each of the values could be repeated, but only adjacently: [ A A A B C C ] for example. @rose00 why only adjacently? Nested locking can be interleaved on different monitors. ------------- PR: https://git.openjdk.org/jdk/pull/10590 From rehn at openjdk.org Fri Oct 28 06:35:07 2022 From: rehn at openjdk.org (Robbin Ehn) Date: Fri, 28 Oct 2022 06:35:07 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking In-Reply-To: <231161996.35475533.1666904633911.JavaMail.zimbra@u-pem.fr> References: <231161996.35475533.1666904633911.JavaMail.zimbra@u-pem.fr> Message-ID: On Fri, 28 Oct 2022 03:32:58 GMT, Remi Forax wrote: > i've some trouble to see how it can be implemented given that because of lock coarsening (+ may be OSR), the number of time a lock is held is different between the interpreted code and the compiled code. Correct me if I'm wrong, only C2 eliminates locks and C2 only compile if there is proper structured locking. This should mean that when we restore the eliminated locks in deopt we can inflate the recursive locks which are no longer interleaved and restructure the lock-stack accordingly. Is there another situation than deopt where it would matter? ------------- PR: https://git.openjdk.org/jdk/pull/10590 From forax at univ-mlv.fr Thu Oct 27 21:03:53 2022 From: forax at univ-mlv.fr (Remi Forax) Date: Thu, 27 Oct 2022 23:03:53 +0200 (CEST) Subject: RFR: 8291555: Replace stack-locking with fast-locking [v7] In-Reply-To: References: Message-ID: <231161996.35475533.1666904633911.JavaMail.zimbra@u-pem.fr> ----- Original Message ----- > From: "John R Rose" > To: hotspot-dev at openjdk.org, serviceability-dev at openjdk.org, shenandoah-dev at openjdk.org > Sent: Thursday, October 27, 2022 10:41:44 PM > Subject: Re: RFR: 8291555: Replace stack-locking with fast-locking [v7] > On Mon, 24 Oct 2022 11:01:01 GMT, Robbin Ehn wrote: > >> Secondly, a question/suggestion: Many recursive cases do not interleave locks, >> meaning the recursive enter will happen with the lock/oop top of lock stack >> already. Why not peak at top lock/oop in lock-stack if the is current just push >> it again and the locking is done? (instead of inflating) (exit would need to >> check if this is the last one and then proper exit) > > The CJM paper (Dice/Kogan 2021) mentions a "nesting" counter for this purpose. > I suspect that a real counter is overkill, and the "unary" representation > Robbin mentions would be fine, especially if there were a point (when the > per-thread stack gets too big) at which we go and inflate anyway. > > The CJM paper suggests a full search of the per-thread array to detect the > recursive condition, but again I like Robbin's idea of checking only the most > recent lock record. > > So the data structure for lock records (per thread) could consist of a series of > distinct values [ A B C ] and each of the values could be repeated, but only > adjacently: [ A A A B C C ] for example. And there could be a depth limit as > well. Any sequence of held locks not expressible within those limitations > could go to inflation as a backup. Hi John, a certainly stupid question, i've some trouble to see how it can be implemented given that because of lock coarsening (+ may be OSR), the number of time a lock is held is different between the interpreted code and the compiled code. R?mi > > ------------- > > PR: https://git.openjdk.org/jdk/pull/10590 From rkennke at openjdk.org Fri Oct 28 09:32:58 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Fri, 28 Oct 2022 09:32:58 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v8] In-Reply-To: References: Message-ID: > This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. > > What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. > > This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. > > In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. > > One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. > > As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. > > This change enables to simplify (and speed-up!) a lot of code: > > - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. > - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR > > ### Benchmarks > > All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. > > #### DaCapo/AArch64 > > Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? > > benchmark | baseline | fast-locking | % | size > -- | -- | -- | -- | -- > avrora | 27859 | 27563 | 1.07% | large > batik | 20786 | 20847 | -0.29% | large > biojava | 27421 | 27334 | 0.32% | default > eclipse | 59918 | 60522 | -1.00% | large > fop | 3670 | 3678 | -0.22% | default > graphchi | 2088 | 2060 | 1.36% | default > h2 | 297391 | 291292 | 2.09% | huge > jme | 8762 | 8877 | -1.30% | default > jython | 18938 | 18878 | 0.32% | default > luindex | 1339 | 1325 | 1.06% | default > lusearch | 918 | 936 | -1.92% | default > pmd | 58291 | 58423 | -0.23% | large > sunflow | 32617 | 24961 | 30.67% | large > tomcat | 25481 | 25992 | -1.97% | large > tradebeans | 314640 | 311706 | 0.94% | huge > tradesoap | 107473 | 110246 | -2.52% | huge > xalan | 6047 | 5882 | 2.81% | default > zxing | 970 | 926 | 4.75% | default > > #### DaCapo/x86_64 > > The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. > > benchmark | baseline | fast-Locking | % | size > -- | -- | -- | -- | -- > avrora | 127690 | 126749 | 0.74% | large > batik | 12736 | 12641 | 0.75% | large > biojava | 15423 | 15404 | 0.12% | default > eclipse | 41174 | 41498 | -0.78% | large > fop | 2184 | 2172 | 0.55% | default > graphchi | 1579 | 1560 | 1.22% | default > h2 | 227614 | 230040 | -1.05% | huge > jme | 8591 | 8398 | 2.30% | default > jython | 13473 | 13356 | 0.88% | default > luindex | 824 | 813 | 1.35% | default > lusearch | 962 | 968 | -0.62% | default > pmd | 40827 | 39654 | 2.96% | large > sunflow | 53362 | 43475 | 22.74% | large > tomcat | 27549 | 28029 | -1.71% | large > tradebeans | 190757 | 190994 | -0.12% | huge > tradesoap | 68099 | 67934 | 0.24% | huge > xalan | 7969 | 8178 | -2.56% | default > zxing | 1176 | 1148 | 2.44% | default > > #### Renaissance/AArch64 > > This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 2558.832 | 2513.594 | 1.80% > Reactors | 14715.626 | 14311.246 | 2.83% > Als | 1851.485 | 1869.622 | -0.97% > ChiSquare | 1007.788 | 1003.165 | 0.46% > GaussMix | 1157.491 | 1149.969 | 0.65% > LogRegression | 717.772 | 733.576 | -2.15% > MovieLens | 7916.181 | 8002.226 | -1.08% > NaiveBayes | 395.296 | 386.611 | 2.25% > PageRank | 4294.939 | 4346.333 | -1.18% > FjKmeans | 496.076 | 493.873 | 0.45% > FutureGenetic | 2578.504 | 2589.255 | -0.42% > Mnemonics | 4898.886 | 4903.689 | -0.10% > ParMnemonics | 4260.507 | 4210.121 | 1.20% > Scrabble | 139.37 | 138.312 | 0.76% > RxScrabble | 320.114 | 322.651 | -0.79% > Dotty | 1056.543 | 1068.492 | -1.12% > ScalaDoku | 3443.117 | 3449.477 | -0.18% > ScalaKmeans | 259.384 | 258.648 | 0.28% > Philosophers | 24333.311 | 23438.22 | 3.82% > ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% > FinagleChirper | 6814.192 | 6853.38 | -0.57% > FinagleHttp | 4762.902 | 4807.564 | -0.93% > > #### Renaissance/x86_64 > > benchmark | baseline | fast-locking | % > -- | -- | -- | -- > AkkaUct | 1117.185 | 1116.425 | 0.07% > Reactors | 11561.354 | 11812.499 | -2.13% > Als | 1580.838 | 1575.318 | 0.35% > ChiSquare | 459.601 | 467.109 | -1.61% > GaussMix | 705.944 | 685.595 | 2.97% > LogRegression | 659.944 | 656.428 | 0.54% > MovieLens | 7434.303 | 7592.271 | -2.08% > NaiveBayes | 413.482 | 417.369 | -0.93% > PageRank | 3259.233 | 3276.589 | -0.53% > FjKmeans | 946.429 | 938.991 | 0.79% > FutureGenetic | 1760.672 | 1815.272 | -3.01% > ParMnemonics | 2016.917 | 2033.101 | -0.80% > Scrabble | 147.996 | 150.084 | -1.39% > RxScrabble | 177.755 | 177.956 | -0.11% > Dotty | 673.754 | 683.919 | -1.49% > ScalaDoku | 2193.562 | 1958.419 | 12.01% > ScalaKmeans | 165.376 | 168.925 | -2.10% > ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% > Philosophers | 14268.449 | 13308.87 | 7.21% > FinagleChirper | 4722.13 | 4688.3 | 0.72% > FinagleHttp | 3497.241 | 3605.118 | -2.99% > > Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. > > I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). > > Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. > > ### Testing > - [x] tier1 (x86_64, aarch64, x86_32) > - [x] tier2 (x86_64, aarch64) > - [x] tier3 (x86_64, aarch64) > - [x] tier4 (x86_64, aarch64) > - [x] jcstress 3-days -t sync -af GLOBAL (x86_64, aarch64) Roman Kennke has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 37 commits: - Merge remote-tracking branch 'upstream/master' into fast-locking - Merge remote-tracking branch 'upstream/master' into fast-locking - Merge remote-tracking branch 'upstream/master' into fast-locking - More RISC-V fixes - Merge remote-tracking branch 'origin/fast-locking' into fast-locking - RISC-V port - Revert "Re-use r0 in call to unlock_object()" This reverts commit ebbcb615a788998596f403b47b72cf133cb9de46. - Merge remote-tracking branch 'origin/fast-locking' into fast-locking - Fix number of rt args to complete_monitor_locking_C, remove some comments - Re-use r0 in call to unlock_object() - ... and 27 more: https://git.openjdk.org/jdk/compare/4b89fce0...3f0acba4 ------------- Changes: https://git.openjdk.org/jdk/pull/10590/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10590&range=07 Stats: 4031 lines in 137 files changed: 731 ins; 2703 del; 597 mod Patch: https://git.openjdk.org/jdk/pull/10590.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10590/head:pull/10590 PR: https://git.openjdk.org/jdk/pull/10590 From rkennke at openjdk.org Fri Oct 28 15:29:39 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Fri, 28 Oct 2022 15:29:39 GMT Subject: RFR: 8291555: Replace stack-locking with fast-locking [v8] In-Reply-To: References: Message-ID: <7ORZSjVcOQ8IrMAC0iS2pgsf_-vMKZQVmfjxAROqVq4=.267878cb-6392-428c-8a11-b431b2e19cfb@github.com> On Fri, 28 Oct 2022 09:32:58 GMT, Roman Kennke wrote: >> This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. >> >> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock. >> >> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols. >> >> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking. >> >> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread. >> >> As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it. >> >> This change enables to simplify (and speed-up!) a lot of code: >> >> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header. >> - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR >> >> ### Benchmarks >> >> All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better. >> >> #### DaCapo/AArch64 >> >> Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference? >> >> benchmark | baseline | fast-locking | % | size >> -- | -- | -- | -- | -- >> avrora | 27859 | 27563 | 1.07% | large >> batik | 20786 | 20847 | -0.29% | large >> biojava | 27421 | 27334 | 0.32% | default >> eclipse | 59918 | 60522 | -1.00% | large >> fop | 3670 | 3678 | -0.22% | default >> graphchi | 2088 | 2060 | 1.36% | default >> h2 | 297391 | 291292 | 2.09% | huge >> jme | 8762 | 8877 | -1.30% | default >> jython | 18938 | 18878 | 0.32% | default >> luindex | 1339 | 1325 | 1.06% | default >> lusearch | 918 | 936 | -1.92% | default >> pmd | 58291 | 58423 | -0.23% | large >> sunflow | 32617 | 24961 | 30.67% | large >> tomcat | 25481 | 25992 | -1.97% | large >> tradebeans | 314640 | 311706 | 0.94% | huge >> tradesoap | 107473 | 110246 | -2.52% | huge >> xalan | 6047 | 5882 | 2.81% | default >> zxing | 970 | 926 | 4.75% | default >> >> #### DaCapo/x86_64 >> >> The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above. >> >> benchmark | baseline | fast-Locking | % | size >> -- | -- | -- | -- | -- >> avrora | 127690 | 126749 | 0.74% | large >> batik | 12736 | 12641 | 0.75% | large >> biojava | 15423 | 15404 | 0.12% | default >> eclipse | 41174 | 41498 | -0.78% | large >> fop | 2184 | 2172 | 0.55% | default >> graphchi | 1579 | 1560 | 1.22% | default >> h2 | 227614 | 230040 | -1.05% | huge >> jme | 8591 | 8398 | 2.30% | default >> jython | 13473 | 13356 | 0.88% | default >> luindex | 824 | 813 | 1.35% | default >> lusearch | 962 | 968 | -0.62% | default >> pmd | 40827 | 39654 | 2.96% | large >> sunflow | 53362 | 43475 | 22.74% | large >> tomcat | 27549 | 28029 | -1.71% | large >> tradebeans | 190757 | 190994 | -0.12% | huge >> tradesoap | 68099 | 67934 | 0.24% | huge >> xalan | 7969 | 8178 | -2.56% | default >> zxing | 1176 | 1148 | 2.44% | default >> >> #### Renaissance/AArch64 >> >> This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings. >> >> benchmark | baseline | fast-locking | % >> -- | -- | -- | -- >> AkkaUct | 2558.832 | 2513.594 | 1.80% >> Reactors | 14715.626 | 14311.246 | 2.83% >> Als | 1851.485 | 1869.622 | -0.97% >> ChiSquare | 1007.788 | 1003.165 | 0.46% >> GaussMix | 1157.491 | 1149.969 | 0.65% >> LogRegression | 717.772 | 733.576 | -2.15% >> MovieLens | 7916.181 | 8002.226 | -1.08% >> NaiveBayes | 395.296 | 386.611 | 2.25% >> PageRank | 4294.939 | 4346.333 | -1.18% >> FjKmeans | 496.076 | 493.873 | 0.45% >> FutureGenetic | 2578.504 | 2589.255 | -0.42% >> Mnemonics | 4898.886 | 4903.689 | -0.10% >> ParMnemonics | 4260.507 | 4210.121 | 1.20% >> Scrabble | 139.37 | 138.312 | 0.76% >> RxScrabble | 320.114 | 322.651 | -0.79% >> Dotty | 1056.543 | 1068.492 | -1.12% >> ScalaDoku | 3443.117 | 3449.477 | -0.18% >> ScalaKmeans | 259.384 | 258.648 | 0.28% >> Philosophers | 24333.311 | 23438.22 | 3.82% >> ScalaStmBench7 | 1102.43 | 1115.142 | -1.14% >> FinagleChirper | 6814.192 | 6853.38 | -0.57% >> FinagleHttp | 4762.902 | 4807.564 | -0.93% >> >> #### Renaissance/x86_64 >> >> benchmark | baseline | fast-locking | % >> -- | -- | -- | -- >> AkkaUct | 1117.185 | 1116.425 | 0.07% >> Reactors | 11561.354 | 11812.499 | -2.13% >> Als | 1580.838 | 1575.318 | 0.35% >> ChiSquare | 459.601 | 467.109 | -1.61% >> GaussMix | 705.944 | 685.595 | 2.97% >> LogRegression | 659.944 | 656.428 | 0.54% >> MovieLens | 7434.303 | 7592.271 | -2.08% >> NaiveBayes | 413.482 | 417.369 | -0.93% >> PageRank | 3259.233 | 3276.589 | -0.53% >> FjKmeans | 946.429 | 938.991 | 0.79% >> FutureGenetic | 1760.672 | 1815.272 | -3.01% >> ParMnemonics | 2016.917 | 2033.101 | -0.80% >> Scrabble | 147.996 | 150.084 | -1.39% >> RxScrabble | 177.755 | 177.956 | -0.11% >> Dotty | 673.754 | 683.919 | -1.49% >> ScalaDoku | 2193.562 | 1958.419 | 12.01% >> ScalaKmeans | 165.376 | 168.925 | -2.10% >> ScalaStmBench7 | 1080.187 | 1049.184 | 2.95% >> Philosophers | 14268.449 | 13308.87 | 7.21% >> FinagleChirper | 4722.13 | 4688.3 | 0.72% >> FinagleHttp | 3497.241 | 3605.118 | -2.99% >> >> Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often. >> >> I have also run another benchmark, which is a popular Java JVM benchmark, with workloads wrapped in JMH and very slightly modified to run with newer JDKs, but I won't publish the results because I am not sure about the licensing terms. They look similar to the measurements above (i.e. +/- 2%, nothing very suspicious). >> >> Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here. >> >> ### Testing >> - [x] tier1 (x86_64, aarch64, x86_32) >> - [x] tier2 (x86_64, aarch64) >> - [x] tier3 (x86_64, aarch64) >> - [x] tier4 (x86_64, aarch64) >> - [x] jcstress 3-days -t sync -af GLOBAL (x86_64, aarch64) > > Roman Kennke has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 37 commits: > > - Merge remote-tracking branch 'upstream/master' into fast-locking > - Merge remote-tracking branch 'upstream/master' into fast-locking > - Merge remote-tracking branch 'upstream/master' into fast-locking > - More RISC-V fixes > - Merge remote-tracking branch 'origin/fast-locking' into fast-locking > - RISC-V port > - Revert "Re-use r0 in call to unlock_object()" > > This reverts commit ebbcb615a788998596f403b47b72cf133cb9de46. > - Merge remote-tracking branch 'origin/fast-locking' into fast-locking > - Fix number of rt args to complete_monitor_locking_C, remove some comments > - Re-use r0 in call to unlock_object() > - ... and 27 more: https://git.openjdk.org/jdk/compare/4b89fce0...3f0acba4 FYI: I am working on an alternative PR for this that makes fast-locking optional and opt-in behind an experimental switch. It will also be much less invasive (no structural changes except absolutely necessary, no cleanups) and thus easier to handle. ------------- PR: https://git.openjdk.org/jdk/pull/10590 From redestad at openjdk.org Fri Oct 28 20:48:10 2022 From: redestad at openjdk.org (Claes Redestad) Date: Fri, 28 Oct 2022 20:48:10 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops Message-ID: Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. With the most recent fixes the x64 intrinsic results on my workstation look like this: Benchmark (size) Mode Cnt Score Error Units StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op Baseline: Benchmark (size) Mode Cnt Score Error Units StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op I.e. no measurable overhead compared to baseline even for `size == 1`. The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. Benchmark for `Arrays.hashCode`: Benchmark (size) Mode Cnt Score Error Units ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op Baseline: Benchmark (size) Mode Cnt Score Error Units ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. ------------- Commit messages: - ws - Add ArraysHashCode microbenchmarks - Fixed vector loops for int and char arrays - Split up Arrays/HashCode tests - Fixes, optimized short inputs, temporarily disabled vector loop for Arrays.hashCode cases, added and improved tests - typo - Add Arrays.hashCode tests, enable intrinsic by default on x86 - Correct start values for array hashCode methods - Merge branch 'master' into 8282664-polyhash - Fold identical ops; only add coef expansion for Arrays cases - ... and 28 more: https://git.openjdk.org/jdk/compare/303548ba...22fec5f0 Changes: https://git.openjdk.org/jdk/pull/10847/files Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=00 Issue: https://bugs.openjdk.org/browse/JDK-8282664 Stats: 1129 lines in 32 files changed: 1071 ins; 32 del; 26 mod Patch: https://git.openjdk.org/jdk/pull/10847.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10847/head:pull/10847 PR: https://git.openjdk.org/jdk/pull/10847 From luhenry at openjdk.org Fri Oct 28 20:48:10 2022 From: luhenry at openjdk.org (Ludovic Henry) Date: Fri, 28 Oct 2022 20:48:10 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops In-Reply-To: References: Message-ID: On Tue, 25 Oct 2022 10:37:40 GMT, Claes Redestad wrote: > Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. > > I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. > > Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. > > The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. > > With the most recent fixes the x64 intrinsic results on my workstation look like this: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op > > I.e. no measurable overhead compared to baseline even for `size == 1`. > > The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. > > Benchmark for `Arrays.hashCode`: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op > ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op > ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op > ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op > ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op > ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op > ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op > ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op > ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op > ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op > ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op > ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op > ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op > ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op > ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op > ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op > ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op > ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op > ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op > ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op > ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op > ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op > ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op > ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op > ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op > ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op > ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op > > > As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. I did a quick write up explaining the approach at https://gist.github.com/luhenry/2fc408be6f906ef79aaf4115525b9d0c. Also, you can find details in @richardstartin's [blog post](https://richardstartin.github.io/posts/vectorised-polynomial-hash-codes) ------------- PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Fri Oct 28 20:48:12 2022 From: redestad at openjdk.org (Claes Redestad) Date: Fri, 28 Oct 2022 20:48:12 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops In-Reply-To: References: Message-ID: <2URA7qiWBkx-l9U0FfNIBNOVyDeToiv8x0fmhHKhGOs=.edad5b57-0986-41ca-83f1-256021f5ec11@github.com> On Tue, 25 Oct 2022 10:37:40 GMT, Claes Redestad wrote: > Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. > > I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. > > Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. > > The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. > > With the most recent fixes the x64 intrinsic results on my workstation look like this: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op > > I.e. no measurable overhead compared to baseline even for `size == 1`. > > The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. > > Benchmark for `Arrays.hashCode`: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op > ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op > ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op > ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op > ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op > ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op > ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op > ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op > ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op > ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op > ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op > ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op > ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op > ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op > ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op > ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op > ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op > ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op > ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op > ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op > ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op > ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op > ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op > ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op > ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op > ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op > ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op > > > As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. While there are some incompleteness (no vectorization of byte and short arrays) I think this is ready to begin reviewing now. Implementing vectorization properly for byte and short arrays can be done as a follow-up, or someone might now a way to sign-extend subword integers properly that fits easily into the intrinsic implementation here. Porting to aarch64 and other platforms can be done as follow-ups and shouldn't block integration. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From duke at openjdk.org Sat Oct 29 09:28:25 2022 From: duke at openjdk.org (Piotr Tarsa) Date: Sat, 29 Oct 2022 09:28:25 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops In-Reply-To: <2URA7qiWBkx-l9U0FfNIBNOVyDeToiv8x0fmhHKhGOs=.edad5b57-0986-41ca-83f1-256021f5ec11@github.com> References: <2URA7qiWBkx-l9U0FfNIBNOVyDeToiv8x0fmhHKhGOs=.edad5b57-0986-41ca-83f1-256021f5ec11@github.com> Message-ID: On Fri, 28 Oct 2022 20:43:04 GMT, Claes Redestad wrote: > Porting to aarch64 and other platforms can be done as follow-ups and shouldn't block integration. I'm not an expert in JVM internals, but there's an already seemingly working String.hashCode intrinsification that's ISA independent: https://github.com/openjdk/jdk/pull/6658 It operates on higher level than direct assembly instructions, i.e. it operates on the ISA-independent vector nodes, so that all hardware platforms that support vectorization would get speedup (i.e. x86-64, x86-32, arm32, arm64, etc), therefore reducing manual work to get all of them working. I wonder why that pull request got no visible interest? Forgive me if I got something wrong :) ------------- PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Sat Oct 29 10:38:08 2022 From: redestad at openjdk.org (Claes Redestad) Date: Sat, 29 Oct 2022 10:38:08 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops In-Reply-To: References: Message-ID: On Tue, 25 Oct 2022 10:37:40 GMT, Claes Redestad wrote: > Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. > > I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. > > Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. > > The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. > > With the most recent fixes the x64 intrinsic results on my workstation look like this: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op > > I.e. no measurable overhead compared to baseline even for `size == 1`. > > The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. > > Benchmark for `Arrays.hashCode`: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op > ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op > ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op > ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op > ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op > ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op > ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op > ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op > ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op > ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op > ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op > ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op > ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op > ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op > ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op > ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op > ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op > ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op > ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op > ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op > ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op > ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op > ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op > ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op > ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op > ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op > ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op > > > As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. > > Porting to aarch64 and other platforms can be done as follow-ups and shouldn't block integration. > > I'm not an expert in JVM internals, but there's an already seemingly working String.hashCode intrinsification that's ISA independent: #6658 It operates on higher level than direct assembly instructions, i.e. it operates on the ISA-independent vector nodes, so that all hardware platforms that support vectorization would get speedup (i.e. x86-64, x86-32, arm32, arm64, etc), therefore reducing manual work to get all of them working. I wonder why that pull request got no visible interest? > > Forgive me if I got something wrong :) I'll have to ask @merykitty why that patch was stalled. Never appeared on my radar until now -- thanks! The approach to use the library call kit API is promising since it avoids the need to port. And with similar results. I'll see if we can merge the approach here of having a shared intrinsic for `Arrays` and `String`, and bring in an ISA-independent backend implementation as in #6658 ------------- PR: https://git.openjdk.org/jdk/pull/10847 From qamai at openjdk.org Sat Oct 29 15:16:34 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Sat, 29 Oct 2022 15:16:34 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops In-Reply-To: References: Message-ID: On Tue, 25 Oct 2022 10:37:40 GMT, Claes Redestad wrote: > Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. > > I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. > > Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. > > The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. > > With the most recent fixes the x64 intrinsic results on my workstation look like this: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op > > I.e. no measurable overhead compared to baseline even for `size == 1`. > > The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. > > Benchmark for `Arrays.hashCode`: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op > ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op > ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op > ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op > ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op > ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op > ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op > ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op > ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op > ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op > ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op > ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op > ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op > ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op > ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op > ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op > ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op > ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op > ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op > ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op > ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op > ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op > ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op > ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op > ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op > ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op > ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op > > > As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. I am planning to submit that patch after finishing with the current under-reviewed PRs. That patch was stalled because there was no node for vectorised unsigned cast and constant values. The first one has been added and the second one may be worked around as in the PR. I also thought of using masked loads for tail processing instead of falling back to scalar implementation. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Sun Oct 30 19:21:28 2022 From: redestad at openjdk.org (Claes Redestad) Date: Sun, 30 Oct 2022 19:21:28 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops In-Reply-To: References: Message-ID: On Sat, 29 Oct 2022 15:11:56 GMT, Quan Anh Mai wrote: > I am planning to submit that patch after finishing with the current under-reviewed PRs. That patch was stalled because there was no node for vectorised unsigned cast and constant values. The first one has been added and the second one may be worked around as in the PR. I also thought of using masked loads for tail processing instead of falling back to scalar implementation. Ok, then I think we might as well move forward with this enhancement first. It'd establish some new tests, microbenchmarks as well as unifying the polynomial hash loops into a single intrinsic endpoint - while also putting back something that would be straightforward to backport (less dependencies on other recent enhancements). Then once the vector IR nodes have matured we can easily rip out the `VectorizedHashCodeNode` and replace it with such an implementation. WDYT? ------------- PR: https://git.openjdk.org/jdk/pull/10847 From qamai at openjdk.org Mon Oct 31 02:49:24 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 31 Oct 2022 02:49:24 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops In-Reply-To: References: Message-ID: <5QCLl4R86LlhX9dkwbK7-NtPwkiN9tgQvj0VFoApvzU=.0b12f837-47d4-470a-9b40-961ccd8e181e@github.com> On Tue, 25 Oct 2022 10:37:40 GMT, Claes Redestad wrote: > Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. > > I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. > > Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. > > The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. > > With the most recent fixes the x64 intrinsic results on my workstation look like this: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op > > I.e. no measurable overhead compared to baseline even for `size == 1`. > > The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. > > Benchmark for `Arrays.hashCode`: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op > ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op > ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op > ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op > ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op > ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op > ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op > ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op > ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op > ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op > ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op > ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op > ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op > ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op > ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op > ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op > ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op > ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op > ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op > ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op > ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op > ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op > ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op > ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op > ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op > ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op > ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op > > > As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. I agree, please go ahead, I leave some comments for the x86 implementation. Thanks. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3358: > 3356: movl(result, is_string_hashcode ? 0 : 1); > 3357: > 3358: // if (cnt1 == 0) { You may want to reorder the execution of the loops, a short array suffers more from processing than a big array, so you should have minimum extra hops for those. For example, I think this could be: if (cnt1 >= 4) { if (cnt1 >= 16) { UNROLLED VECTOR LOOP SINGLE VECTOR LOOP } UNROLLED SCALAR LOOP } SINGLE SCALAR LOOP The thresholds are arbitrary and need to be measured carefully. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3374: > 3372: > 3373: // int i = 0; > 3374: movl(index, 0); `xorl(index, index)` src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3387: > 3385: for (int idx = 0; idx < 4; idx++) { > 3386: // h = (31 * h) or (h << 5 - h); > 3387: movl(tmp, result); If you are unrolling this, maybe break the dependency chain, `h = h * 31**4 + x[i] * 31**3 + x[i + 1] * 31**2 + x[i + 2] * 31 + x[i + 3]` src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3418: > 3416: // } else { // cnt1 >= 32 > 3417: address power_of_31_backwards = pc(); > 3418: emit_int32( 2111290369); Can this giant table be shared among compilations instead? src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3484: > 3482: decrementl(index); > 3483: jmpb(LONG_SCALAR_LOOP_BEGIN); > 3484: bind(LONG_SCALAR_LOOP_END); You can share this loop with the scalar ones above. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3493: > 3491: // vnext = IntVector.broadcast(I256, power_of_31_backwards[0]); > 3492: movdl(vnext, InternalAddress(power_of_31_backwards + (0 * sizeof(jint)))); > 3493: vpbroadcastd(vnext, vnext, Assembler::AVX_256bit); `vpbroadcastd` can take an `Address` argument instead. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3523: > 3521: subl(index, 32); > 3522: // i >= 0; > 3523: cmpl(index, 0); You don't need this since `subl` sets flags according to the result. src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3528: > 3526: vpmulld(vcoef[idx], vcoef[idx], vnext, Assembler::AVX_256bit); > 3527: } > 3528: jmp(LONG_VECTOR_LOOP_BEGIN); Calculating backward forces you to do calculating the coefficients on each iteration, I think doing this normally would be better. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From qamai at openjdk.org Mon Oct 31 02:49:25 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 31 Oct 2022 02:49:25 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops In-Reply-To: <5QCLl4R86LlhX9dkwbK7-NtPwkiN9tgQvj0VFoApvzU=.0b12f837-47d4-470a-9b40-961ccd8e181e@github.com> References: <5QCLl4R86LlhX9dkwbK7-NtPwkiN9tgQvj0VFoApvzU=.0b12f837-47d4-470a-9b40-961ccd8e181e@github.com> Message-ID: On Mon, 31 Oct 2022 02:12:22 GMT, Quan Anh Mai wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look like this: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op >> ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op >> ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op >> ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op >> ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op >> ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op >> ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op >> ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op >> ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op >> ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op >> ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op >> ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op >> ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op >> ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op >> ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op >> ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op >> ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op >> ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op >> ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op >> ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op >> ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op >> ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op >> ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op >> ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op >> ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3387: > >> 3385: for (int idx = 0; idx < 4; idx++) { >> 3386: // h = (31 * h) or (h << 5 - h); >> 3387: movl(tmp, result); > > If you are unrolling this, maybe break the dependency chain, `h = h * 31**4 + x[i] * 31**3 + x[i + 1] * 31**2 + x[i + 2] * 31 + x[i + 3]` A 256-bit vector is only 8 ints so this loop seems redundant, maybe running with the stride of 2 instead, in which case the single scalar calculation does also not need a loop. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Mon Oct 31 12:07:43 2022 From: redestad at openjdk.org (Claes Redestad) Date: Mon, 31 Oct 2022 12:07:43 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v2] In-Reply-To: References: Message-ID: > Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. > > I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. > > Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. > > The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. > > With the most recent fixes the x64 intrinsic results on my workstation look like this: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op > > I.e. no measurable overhead compared to baseline even for `size == 1`. > > The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. > > Benchmark for `Arrays.hashCode`: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op > ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op > ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op > ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op > ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op > ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op > ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op > ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op > ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op > ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op > ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op > ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op > ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op > ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op > ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op > ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op > ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op > ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op > ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op > ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op > ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op > ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op > ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op > ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op > ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op > ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op > ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op > > > As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: Reorder loops and some other suggestions from @merykitty ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10847/files - new: https://git.openjdk.org/jdk/pull/10847/files/22fec5f0..6aed1c1e Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=01 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=00-01 Stats: 110 lines in 1 file changed: 59 ins; 45 del; 6 mod Patch: https://git.openjdk.org/jdk/pull/10847.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10847/head:pull/10847 PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Mon Oct 31 12:28:10 2022 From: redestad at openjdk.org (Claes Redestad) Date: Mon, 31 Oct 2022 12:28:10 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v2] In-Reply-To: <5QCLl4R86LlhX9dkwbK7-NtPwkiN9tgQvj0VFoApvzU=.0b12f837-47d4-470a-9b40-961ccd8e181e@github.com> References: <5QCLl4R86LlhX9dkwbK7-NtPwkiN9tgQvj0VFoApvzU=.0b12f837-47d4-470a-9b40-961ccd8e181e@github.com> Message-ID: On Mon, 31 Oct 2022 02:34:06 GMT, Quan Anh Mai wrote: >> Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: >> >> Reorder loops and some other suggestions from @merykitty > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3484: > >> 3482: decrementl(index); >> 3483: jmpb(LONG_SCALAR_LOOP_BEGIN); >> 3484: bind(LONG_SCALAR_LOOP_END); > > You can share this loop with the scalar ones above. This might be messier than it first looks, since the two different loops use different temp registers based (long scalar can scratch cnt1, short scalar scratches the coef register). I'll have to think about this for a bit. > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3523: > >> 3521: subl(index, 32); >> 3522: // i >= 0; >> 3523: cmpl(index, 0); > > You don't need this since `subl` sets flags according to the result. Fixed ------------- PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Mon Oct 31 12:32:34 2022 From: redestad at openjdk.org (Claes Redestad) Date: Mon, 31 Oct 2022 12:32:34 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v2] In-Reply-To: <5QCLl4R86LlhX9dkwbK7-NtPwkiN9tgQvj0VFoApvzU=.0b12f837-47d4-470a-9b40-961ccd8e181e@github.com> References: <5QCLl4R86LlhX9dkwbK7-NtPwkiN9tgQvj0VFoApvzU=.0b12f837-47d4-470a-9b40-961ccd8e181e@github.com> Message-ID: On Mon, 31 Oct 2022 02:21:44 GMT, Quan Anh Mai wrote: >> Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: >> >> Reorder loops and some other suggestions from @merykitty > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3358: > >> 3356: movl(result, is_string_hashcode ? 0 : 1); >> 3357: >> 3358: // if (cnt1 == 0) { > > You may want to reorder the execution of the loops, a short array suffers more from processing than a big array, so you should have minimum extra hops for those. For example, I think this could be: > > if (cnt1 >= 4) { > if (cnt1 >= 16) { > UNROLLED VECTOR LOOP > SINGLE VECTOR LOOP > } > UNROLLED SCALAR LOOP > } > SINGLE SCALAR LOOP > > The thresholds are arbitrary and need to be measured carefully. Fixed > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3374: > >> 3372: >> 3373: // int i = 0; >> 3374: movl(index, 0); > > `xorl(index, index)` Fixed > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3418: > >> 3416: // } else { // cnt1 >= 32 >> 3417: address power_of_31_backwards = pc(); >> 3418: emit_int32( 2111290369); > > Can this giant table be shared among compilations instead? Probably, though I'm not entirely sure on how. Maybe the "long" cases should be factored out into a set of stub routines so that it's not inlined in numerous places anyway. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Mon Oct 31 12:32:34 2022 From: redestad at openjdk.org (Claes Redestad) Date: Mon, 31 Oct 2022 12:32:34 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v2] In-Reply-To: References: <5QCLl4R86LlhX9dkwbK7-NtPwkiN9tgQvj0VFoApvzU=.0b12f837-47d4-470a-9b40-961ccd8e181e@github.com> Message-ID: On Mon, 31 Oct 2022 02:15:35 GMT, Quan Anh Mai wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3387: >> >>> 3385: for (int idx = 0; idx < 4; idx++) { >>> 3386: // h = (31 * h) or (h << 5 - h); >>> 3387: movl(tmp, result); >> >> If you are unrolling this, maybe break the dependency chain, `h = h * 31**4 + x[i] * 31**3 + x[i + 1] * 31**2 + x[i + 2] * 31 + x[i + 3]` > > A 256-bit vector is only 8 ints so this loop seems redundant, maybe running with the stride of 2 instead, in which case the single scalar calculation does also not need a loop. Working on this.. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Mon Oct 31 12:35:26 2022 From: redestad at openjdk.org (Claes Redestad) Date: Mon, 31 Oct 2022 12:35:26 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v3] In-Reply-To: References: Message-ID: > Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. > > I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. > > Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. > > The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. > > With the most recent fixes the x64 intrinsic results on my workstation look like this: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op > > I.e. no measurable overhead compared to baseline even for `size == 1`. > > The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. > > Benchmark for `Arrays.hashCode`: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op > ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op > ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op > ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op > ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op > ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op > ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op > ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op > ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op > ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op > ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op > ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op > ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op > ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op > ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op > ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op > ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op > ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op > ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op > ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op > ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op > ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op > ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op > ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op > ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op > ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op > ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op > > > As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: Require UseSSE >= 3 due transitive use of sse3 instructions from ReduceI ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10847/files - new: https://git.openjdk.org/jdk/pull/10847/files/6aed1c1e..7e8a3e9c Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=02 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=01-02 Stats: 1 line in 1 file changed: 0 ins; 0 del; 1 mod Patch: https://git.openjdk.org/jdk/pull/10847.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10847/head:pull/10847 PR: https://git.openjdk.org/jdk/pull/10847 From luhenry at openjdk.org Mon Oct 31 13:22:32 2022 From: luhenry at openjdk.org (Ludovic Henry) Date: Mon, 31 Oct 2022 13:22:32 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v3] In-Reply-To: <5QCLl4R86LlhX9dkwbK7-NtPwkiN9tgQvj0VFoApvzU=.0b12f837-47d4-470a-9b40-961ccd8e181e@github.com> References: <5QCLl4R86LlhX9dkwbK7-NtPwkiN9tgQvj0VFoApvzU=.0b12f837-47d4-470a-9b40-961ccd8e181e@github.com> Message-ID: On Mon, 31 Oct 2022 02:35:18 GMT, Quan Anh Mai wrote: >> Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: >> >> Require UseSSE >= 3 due transitive use of sse3 instructions from ReduceI > > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3493: > >> 3491: // vnext = IntVector.broadcast(I256, power_of_31_backwards[0]); >> 3492: movdl(vnext, InternalAddress(power_of_31_backwards + (0 * sizeof(jint)))); >> 3493: vpbroadcastd(vnext, vnext, Assembler::AVX_256bit); > > `vpbroadcastd` can take an `Address` argument instead. An `InternalAddress` isn't an `Address` but an `AddressLiteral`. You can however do `as_Address(InternalAddress(power_of_31_backwards + (0 * sizeof(jint))))` > src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3528: > >> 3526: vpmulld(vcoef[idx], vcoef[idx], vnext, Assembler::AVX_256bit); >> 3527: } >> 3528: jmp(LONG_VECTOR_LOOP_BEGIN); > > Calculating backward forces you to do calculating the coefficients on each iteration, I think doing this normally would be better. But doing it forward requires a `reduceLane` on each iteration. It's faster to do it backward. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From qamai at openjdk.org Mon Oct 31 13:38:47 2022 From: qamai at openjdk.org (Quan Anh Mai) Date: Mon, 31 Oct 2022 13:38:47 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v3] In-Reply-To: References: <5QCLl4R86LlhX9dkwbK7-NtPwkiN9tgQvj0VFoApvzU=.0b12f837-47d4-470a-9b40-961ccd8e181e@github.com> Message-ID: On Mon, 31 Oct 2022 13:18:35 GMT, Ludovic Henry wrote: >> src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 3528: >> >>> 3526: vpmulld(vcoef[idx], vcoef[idx], vnext, Assembler::AVX_256bit); >>> 3527: } >>> 3528: jmp(LONG_VECTOR_LOOP_BEGIN); >> >> Calculating backward forces you to do calculating the coefficients on each iteration, I think doing this normally would be better. > > But doing it forward requires a `reduceLane` on each iteration. It's faster to do it backward. No you don't need to, the vector loop can be calculated as: IntVector accumulation = IntVector.zero(INT_SPECIES); for (int i = 0; i < bound; i += INT_SPECIES.length()) { IntVector current = IntVector.load(INT_SPECIES, array, i); accumulation = accumulation.mul(31**(INT_SPECIES.length())).add(current); } return accumulation.mul(IntVector.of(31**INT_SPECIES.length() - 1, ..., 31**2, 31, 1).reduce(ADD); Each iteration only requires a multiplication and an addition. The weight of lanes can be calculated just before the reduction operation. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From kdnilsen at openjdk.org Mon Oct 31 16:11:43 2022 From: kdnilsen at openjdk.org (Kelvin Nilsen) Date: Mon, 31 Oct 2022 16:11:43 GMT Subject: RFR: Fix assertion error with advance promotion budgeting Message-ID: <6OaYOfQqvIkYAE6dl93o_GprE8uB7dPnchns732AQzg=.1fe8d2de-641c-48c8-bfc1-c20e25aea153@github.com> Round-off errors were resulting in an assertion error. Budgeting calculations are "complicated" because only regions that are fully empty may be loaned from old-gen to young-gen. This change recalculates certain values during budgeting adjustments that follow collection set selection rather than endeavoring to make changes to the values computed before collection set selection. The API is simpler as a result. ------------- Commit messages: - Fix whitespace - Remove diagnostic messages - Remove minimum_evacuation_reserve arg from budgeting calculations - Remove regions_available_to_loan argument from budgeting - Remove old_regions_loaned_for_young_evac parameter - Fix round-off errrors that were causing an assertion failure - Revise budgeting calculations to address assertion error Changes: https://git.openjdk.org/shenandoah/pull/165/files Webrev: https://webrevs.openjdk.org/?repo=shenandoah&pr=165&range=00 Stats: 277 lines in 2 files changed: 170 ins; 53 del; 54 mod Patch: https://git.openjdk.org/shenandoah/pull/165.diff Fetch: git fetch https://git.openjdk.org/shenandoah pull/165/head:pull/165 PR: https://git.openjdk.org/shenandoah/pull/165 From rkennke at openjdk.org Mon Oct 31 16:47:30 2022 From: rkennke at openjdk.org (Roman Kennke) Date: Mon, 31 Oct 2022 16:47:30 GMT Subject: RFR: Fix assertion error with advance promotion budgeting In-Reply-To: <6OaYOfQqvIkYAE6dl93o_GprE8uB7dPnchns732AQzg=.1fe8d2de-641c-48c8-bfc1-c20e25aea153@github.com> References: <6OaYOfQqvIkYAE6dl93o_GprE8uB7dPnchns732AQzg=.1fe8d2de-641c-48c8-bfc1-c20e25aea153@github.com> Message-ID: On Mon, 31 Oct 2022 14:31:49 GMT, Kelvin Nilsen wrote: > Round-off errors were resulting in an assertion error. Budgeting calculations are "complicated" because only regions that are fully empty may be loaned from old-gen to young-gen. This change recalculates certain values during budgeting adjustments that follow collection set selection rather than endeavoring to make changes to the values computed before collection set selection. The API is simpler as a result. Looks ok to me. ------------- Marked as reviewed by rkennke (Lead). PR: https://git.openjdk.org/shenandoah/pull/165 From phh at openjdk.org Mon Oct 31 21:46:13 2022 From: phh at openjdk.org (Paul Hohensee) Date: Mon, 31 Oct 2022 21:46:13 GMT Subject: RFR: Make use of nanoseconds for GC times [v2] In-Reply-To: References: Message-ID: <0CwATc7RlFZIcKlUEt3CMSK1m4cTJM8LUF7zvgsKgiA=.606cc91b-9c0a-443c-9d20-78c4be344bc6@github.com> On Fri, 3 Dec 2021 00:01:56 GMT, David Alvarez wrote: >> In multiple places for hotspot management the resolution used for times is milliseconds. With new collectors getting into sub-millisecond pause times, this resolution is not enough. >> >> This change moves internal values in LastGcStat to use milliseconds. GcInfo is still reporting the values in milliseconds for compatibility reasons > > David Alvarez has refreshed the contents of this pull request, and previous commits have been removed. The incremental views will show differences compared to the previous content of the PR. The pull request contains one new commit since the last revision: > > Make use of nanoseconds for GC times This PR should be against jdk tip, not shenandoah. In timer.cpp, use "counter_to_seconds(counter) * MILLIUNITS" instead of "counter_to_seconds(counter) * 1000.0". In management.cpp, use NANOUNITS instead of (double)1000000000.0 and MILLIUNITS instead of (double)1000.0. Otherwise, lgtm. ------------- Changes requested by phh (no project role). PR: https://git.openjdk.org/shenandoah/pull/102 From redestad at openjdk.org Mon Oct 31 21:48:37 2022 From: redestad at openjdk.org (Claes Redestad) Date: Mon, 31 Oct 2022 21:48:37 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v4] In-Reply-To: References: Message-ID: > Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. > > I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. > > Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. > > The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. > > With the most recent fixes the x64 intrinsic results on my workstation look like this: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op > StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op > StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op > StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op > > I.e. no measurable overhead compared to baseline even for `size == 1`. > > The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. > > Benchmark for `Arrays.hashCode`: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op > ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op > ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op > ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op > ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op > ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op > ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op > ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op > ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op > ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op > ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op > ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op > ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op > ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op > > Baseline: > > Benchmark (size) Mode Cnt Score Error Units > ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op > ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op > ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op > ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op > ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op > ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op > ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op > ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op > ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op > ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op > ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op > ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op > ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op > ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op > ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op > ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op > > > As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: Change scalar unroll to 2 element stride, minding dependency chain ------------- Changes: - all: https://git.openjdk.org/jdk/pull/10847/files - new: https://git.openjdk.org/jdk/pull/10847/files/7e8a3e9c..a473c200 Webrevs: - full: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=03 - incr: https://webrevs.openjdk.org/?repo=jdk&pr=10847&range=02-03 Stats: 64 lines in 1 file changed: 28 ins; 20 del; 16 mod Patch: https://git.openjdk.org/jdk/pull/10847.diff Fetch: git fetch https://git.openjdk.org/jdk pull/10847/head:pull/10847 PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Mon Oct 31 22:10:30 2022 From: redestad at openjdk.org (Claes Redestad) Date: Mon, 31 Oct 2022 22:10:30 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v4] In-Reply-To: References: Message-ID: <3ADHhMibv2q23PC2uQp57gFynSbqH6K4s0jCutZuogM=.b62084b3-bfab-4150-9b2a-e06813099ce8@github.com> On Mon, 31 Oct 2022 21:48:37 GMT, Claes Redestad wrote: >> Continuing the work initiated by @luhenry to unroll and then intrinsify polynomial hash loops. >> >> I've rewired the library changes to route via a single `@IntrinsicCandidate` method. To make this work I've harmonized how they are invoked so that there's less special handling and checks in the intrinsic. Mainly do the null-check outside of the intrinsic for `Arrays.hashCode` cases. >> >> Having a centralized entry point means it'll be easier to parameterize the factor and start values which are now hard-coded (always 31, and a start value of either one for `Arrays` or zero for `String`). It seems somewhat premature to parameterize this up front. >> >> The current implementation is performance neutral on microbenchmarks on all tested platforms (x64, aarch64) when not enabling the intrinsic. We do add a few trivial method calls which increase the call stack depth, so surprises cannot be ruled out on complex workloads. >> >> With the most recent fixes the x64 intrinsic results on my workstation look like this: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.199 ? 0.017 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 6.933 ? 0.049 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 29.935 ? 0.221 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 1596.982 ? 7.020 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> StringHashCode.Algorithm.defaultLatin1 1 avgt 5 2.200 ? 0.013 ns/op >> StringHashCode.Algorithm.defaultLatin1 10 avgt 5 9.424 ? 0.122 ns/op >> StringHashCode.Algorithm.defaultLatin1 100 avgt 5 90.541 ? 0.512 ns/op >> StringHashCode.Algorithm.defaultLatin1 10000 avgt 5 9425.321 ? 67.630 ns/op >> >> I.e. no measurable overhead compared to baseline even for `size == 1`. >> >> The vectorized code now nominally works for all unsigned cases as well as ints, though more testing would be good. >> >> Benchmark for `Arrays.hashCode`: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 1.884 ? 0.013 ns/op >> ArraysHashCode.bytes 10 avgt 5 6.955 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 87.218 ? 0.595 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9419.591 ? 38.308 ns/op >> ArraysHashCode.chars 1 avgt 5 2.200 ? 0.010 ns/op >> ArraysHashCode.chars 10 avgt 5 6.935 ? 0.034 ns/op >> ArraysHashCode.chars 100 avgt 5 30.216 ? 0.134 ns/op >> ArraysHashCode.chars 10000 avgt 5 1601.629 ? 6.418 ns/op >> ArraysHashCode.ints 1 avgt 5 2.200 ? 0.007 ns/op >> ArraysHashCode.ints 10 avgt 5 6.936 ? 0.034 ns/op >> ArraysHashCode.ints 100 avgt 5 29.412 ? 0.268 ns/op >> ArraysHashCode.ints 10000 avgt 5 1610.578 ? 7.785 ns/op >> ArraysHashCode.shorts 1 avgt 5 1.885 ? 0.012 ns/op >> ArraysHashCode.shorts 10 avgt 5 6.961 ? 0.034 ns/op >> ArraysHashCode.shorts 100 avgt 5 87.095 ? 0.417 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.617 ? 50.089 ns/op >> >> Baseline: >> >> Benchmark (size) Mode Cnt Score Error Units >> ArraysHashCode.bytes 1 avgt 5 3.213 ? 0.207 ns/op >> ArraysHashCode.bytes 10 avgt 5 8.483 ? 0.040 ns/op >> ArraysHashCode.bytes 100 avgt 5 90.315 ? 0.655 ns/op >> ArraysHashCode.bytes 10000 avgt 5 9422.094 ? 62.402 ns/op >> ArraysHashCode.chars 1 avgt 5 3.040 ? 0.066 ns/op >> ArraysHashCode.chars 10 avgt 5 8.497 ? 0.074 ns/op >> ArraysHashCode.chars 100 avgt 5 90.074 ? 0.387 ns/op >> ArraysHashCode.chars 10000 avgt 5 9420.474 ? 41.619 ns/op >> ArraysHashCode.ints 1 avgt 5 2.827 ? 0.019 ns/op >> ArraysHashCode.ints 10 avgt 5 7.727 ? 0.043 ns/op >> ArraysHashCode.ints 100 avgt 5 89.405 ? 0.593 ns/op >> ArraysHashCode.ints 10000 avgt 5 9426.539 ? 51.308 ns/op >> ArraysHashCode.shorts 1 avgt 5 3.071 ? 0.062 ns/op >> ArraysHashCode.shorts 10 avgt 5 8.168 ? 0.049 ns/op >> ArraysHashCode.shorts 100 avgt 5 90.399 ? 0.292 ns/op >> ArraysHashCode.shorts 10000 avgt 5 9420.171 ? 44.474 ns/op >> >> >> As we can see the `Arrays` intrinsics are faster for small inputs, and faster on large inputs for `char` and `int` (the ones currently vectorized). I aim to fix `byte` and `short` cases before integrating, though it might be acceptable to hand that off as follow-up enhancements to not further delay integration of this enhancement. > > Claes Redestad has updated the pull request incrementally with one additional commit since the last revision: > > Change scalar unroll to 2 element stride, minding dependency chain A stride of 2 allows small element cases to perform a bit better, while also performing better than before on longer arrays for the `byte` and `short` cases that don't get any benefit from vectorization: Benchmark (size) Mode Cnt Score Error Units ArraysHashCode.bytes 1 avgt 5 1.414 ? 0.005 ns/op ArraysHashCode.bytes 10 avgt 5 6.908 ? 0.020 ns/op ArraysHashCode.bytes 100 avgt 5 73.666 ? 0.390 ns/op ArraysHashCode.bytes 10000 avgt 5 7846.994 ? 53.628 ns/op ArraysHashCode.chars 1 avgt 5 1.414 ? 0.007 ns/op ArraysHashCode.chars 10 avgt 5 7.229 ? 0.044 ns/op ArraysHashCode.chars 100 avgt 5 30.718 ? 0.229 ns/op ArraysHashCode.chars 10000 avgt 5 1621.463 ? 116.286 ns/op ArraysHashCode.ints 1 avgt 5 1.414 ? 0.008 ns/op ArraysHashCode.ints 10 avgt 5 7.540 ? 0.042 ns/op ArraysHashCode.ints 100 avgt 5 29.429 ? 0.121 ns/op ArraysHashCode.ints 10000 avgt 5 1600.855 ? 9.274 ns/op ArraysHashCode.shorts 1 avgt 5 1.414 ? 0.010 ns/op ArraysHashCode.shorts 10 avgt 5 6.914 ? 0.045 ns/op ArraysHashCode.shorts 100 avgt 5 73.684 ? 0.501 ns/op ArraysHashCode.shorts 10000 avgt 5 7846.829 ? 49.984 ns/op I've also made some changes to improve the String cases, which can avoid the first coeff*h multiplication on first pass. This gets the size 1 latin1 case down to 1.1ns/op without penalizing the empty case. We're now improving over the baseline on almost all* tested sizes: Benchmark (size) Mode Cnt Score Error Units StringHashCode.Algorithm.defaultLatin1 0 avgt 5 0.946 ? 0.005 ns/op StringHashCode.Algorithm.defaultLatin1 1 avgt 5 1.108 ? 0.003 ns/op StringHashCode.Algorithm.defaultLatin1 2 avgt 5 2.042 ? 0.005 ns/op StringHashCode.Algorithm.defaultLatin1 31 avgt 5 18.636 ? 0.286 ns/op StringHashCode.Algorithm.defaultLatin1 32 avgt 5 15.938 ? 1.086 ns/op StringHashCode.Algorithm.defaultUTF16 0 avgt 5 1.257 ? 0.004 ns/op StringHashCode.Algorithm.defaultUTF16 1 avgt 5 2.198 ? 0.005 ns/op StringHashCode.Algorithm.defaultUTF16 2 avgt 5 2.559 ? 0.011 ns/op StringHashCode.Algorithm.defaultUTF16 31 avgt 5 15.754 ? 0.036 ns/op StringHashCode.Algorithm.defaultUTF16 32 avgt 5 16.616 ? 0.042 ns/op Baseline: Benchmark (size) Mode Cnt Score Error Units StringHashCode.Algorithm.defaultLatin1 0 avgt 5 0.942 ? 0.005 ns/op StringHashCode.Algorithm.defaultLatin1 1 avgt 5 1.991 ? 0.013 ns/op StringHashCode.Algorithm.defaultLatin1 2 avgt 5 2.831 ? 0.021 ns/op StringHashCode.Algorithm.defaultLatin1 31 avgt 5 25.042 ? 0.112 ns/op StringHashCode.Algorithm.defaultLatin1 32 avgt 5 25.857 ? 0.133 ns/op StringHashCode.Algorithm.defaultUTF16 0 avgt 5 0.789 ? 0.006 ns/op StringHashCode.Algorithm.defaultUTF16 1 avgt 5 3.459 ? 0.007 ns/op StringHashCode.Algorithm.defaultUTF16 2 avgt 5 4.400 ? 0.010 ns/op StringHashCode.Algorithm.defaultUTF16 31 avgt 5 25.721 ? 0.067 ns/op StringHashCode.Algorithm.defaultUTF16 32 avgt 5 27.162 ? 0.093 ns/op There's a negligible regression on `defaultUTF16` for size = 0 due moving the length shift up earlier. This can only happen when running with CompactStrings disabled. And even if you were the change significantly helps improve size 1-31, which should more than make up for a small cost increase hashing empty strings. ------------- PR: https://git.openjdk.org/jdk/pull/10847 From redestad at openjdk.org Mon Oct 31 22:10:31 2022 From: redestad at openjdk.org (Claes Redestad) Date: Mon, 31 Oct 2022 22:10:31 GMT Subject: RFR: 8282664: Unroll by hand StringUTF16 and StringLatin1 polynomial hash loops [v4] In-Reply-To: References: <5QCLl4R86LlhX9dkwbK7-NtPwkiN9tgQvj0VFoApvzU=.0b12f837-47d4-470a-9b40-961ccd8e181e@github.com> Message-ID: On Mon, 31 Oct 2022 13:35:36 GMT, Quan Anh Mai wrote: >> But doing it forward requires a `reduceLane` on each iteration. It's faster to do it backward. > > No you don't need to, the vector loop can be calculated as: > > IntVector accumulation = IntVector.zero(INT_SPECIES); > for (int i = 0; i < bound; i += INT_SPECIES.length()) { > IntVector current = IntVector.load(INT_SPECIES, array, i); > accumulation = accumulation.mul(31**(INT_SPECIES.length())).add(current); > } > return accumulation.mul(IntVector.of(31**INT_SPECIES.length() - 1, ..., 31**2, 31, 1).reduce(ADD); > > Each iteration only requires a multiplication and an addition. The weight of lanes can be calculated just before the reduction operation. Ok, I can try rewriting as @merykitty suggests and compare. I'm running out of time to spend on this right now, though, so I sort of hope we can do this experiment as a follow-up RFE. ------------- PR: https://git.openjdk.org/jdk/pull/10847