RFR: 8291555: Replace stack-locking with fast-locking

Thu Oct 6 07:47:05 UTC 2022

On Thu, 28 Jul 2022 19:58:34 GMT, Roman Kennke <rkennke at openjdk.org> wrote:

> This change replaces the current stack-locking implementation with a fast-locking scheme that retains the advantages of stack-locking (namely fast locking in uncontended code-paths), while avoiding the overload of the mark word. That overloading causes massive problems with Lilliput, because it means we have to check and deal with this situation. And because of the very racy nature, this turns out to be very complex and involved a variant of the inflation protocol to ensure that the object header is stable. 
> 
> What the original stack-locking does is basically to push a stack-lock onto the stack which consists only of the displaced header, and CAS a pointer to this stack location into the object header (the lowest two header bits being 00 indicate 'stack-locked'). The pointer into the stack can then be used to identify which thread currently owns the lock.
> 
> This change basically reverses stack-locking: It still CASes the lowest two header bits to 00 to indicate 'fast-locked' but does *not* overload the upper bits with a stack-pointer. Instead, it pushes the object-reference to a thread-local lock-stack. This is a new structure which is basically a small array of oops that is associated with each thread. Experience shows that this array typcially remains very small (3-5 elements). Using this lock stack, it is possible to query which threads own which locks. Most importantly, the most common question 'does the current thread own me?' is very quickly answered by doing a quick scan of the array. More complex queries like 'which thread owns X?' are not performed in very performance-critical paths (usually in code like JVMTI or deadlock detection) where it is ok to do more complex operations. The lock-stack is also a new set of GC roots, and would be scanned during thread scanning, possibly concurrently, via the normal protocols.
> 
> In contrast to stack-locking, fast-locking does *not* support recursive locking (yet). When that happens, the fast-lock gets inflated to a full monitor. It is not clear if it is worth to add support for recursive fast-locking.
> 
> One trouble is that when a contending thread arrives at a fast-locked object, it must inflate the fast-lock to a full monitor. Normally, we need to know the current owning thread, and record that in the monitor, so that the contending thread can wait for the current owner to properly exit the monitor. However, fast-locking doesn't have this information. What we do instead is to record a special marker ANONYMOUS_OWNER. When the thread that currently holds the lock arrives at monitorexit, and observes ANONYMOUS_OWNER, it knows it must be itself, fixes the owner to be itself, and then properly exits the monitor, and thus handing over to the contending thread.
> 
> As an alternative, I considered to remove stack-locking altogether, and only use heavy monitors. In most workloads this did not show measurable regressions. However, in a few workloads, I have observed severe regressions. All of them have been using old synchronized Java collections (Vector, Stack), StringBuffer or similar code. The combination of two conditions leads to regressions without stack- or fast-locking: 1. The workload synchronizes on uncontended locks (e.g. single-threaded use of Vector or StringBuffer) and 2. The workload churns such locks. IOW, uncontended use of Vector, StringBuffer, etc as such is ok, but creating lots of such single-use, single-threaded-locked objects leads to massive ObjectMonitor churn, which can lead to a significant performance impact. But alas, such code exists, and we probably don't want to punish it if we can avoid it.
> 
> This change enables to simplify (and speed-up!) a lot of code:
> 
> - The inflation protocol is no longer necessary: we can directly CAS the (tagged) ObjectMonitor pointer to the object header.
> - Accessing the hashcode could now be done in the fastpath always, if the hashcode has been installed. Fast-locked headers can be used directly, for monitor-locked objects we can easily reach-through to the displaced header. This is safe because Java threads participate in monitor deflation protocol. This would be implemented in a separate PR
> 
> ### Benchmarks
> 
> All benchmarks are run on server-class metal machines. The JVM settings are always: `-Xmx20g -Xms20g -XX:+UseParallelGC`. All benchmarks are ms/ops, less is better.
> 
> #### DaCapo/AArch64
> 
> Those measurements have been taken on a Graviton2 box with 64 CPU cores (an AWS m6g.metal instance). It is using DaCapo evaluation version, git hash 309e1fa (download file dacapo-evaluation-git+309e1fa.jar). I needed to exclude cassandra, h2o & kafka benchmarks because of incompatibility with JDK20. Benchmarks that showed results far off the baseline or showed high variance have been repeated and I am reporting results with the most bias *against* fast-locking. The sunflow benchmark is really far off the mark - the baseline run with stack-locking exhibited very high run-to-run variance and generally much worse performance, while with fast-locking the variance was very low and the results very stable between runs. I wouldn't trust that benchmark - I mean what is it actually doing that a change in locking shows >30% perf difference?
> 
> benchmark | baseline | fast-locking | % | size
> -- | -- | -- | -- | --
> avrora | 27859 | 27563 | 1.07% | large
> batik | 20786 | 20847 | -0.29% | large
> biojava | 27421 | 27334 | 0.32% | default
> eclipse | 59918 | 60522 | -1.00% | large
> fop | 3670 | 3678 | -0.22% | default
> graphchi | 2088 | 2060 | 1.36% | default
> h2 | 297391 | 291292 | 2.09% | huge
> jme | 8762 | 8877 | -1.30% | default
> jython | 18938 | 18878 | 0.32% | default
> luindex | 1339 | 1325 | 1.06% | default
> lusearch | 918 | 936 | -1.92% | default
> pmd | 58291 | 58423 | -0.23% | large
> sunflow | 32617 | 24961 | 30.67% | large
> tomcat | 25481 | 25992 | -1.97% | large
> tradebeans | 314640 | 311706 | 0.94% | huge
> tradesoap | 107473 | 110246 | -2.52% | huge
> xalan | 6047 | 5882 | 2.81% | default
> zxing | 970 | 926 | 4.75% | default
> 
> #### DaCapo/x86_64
> 
> The following measurements have been taken on an Intel Xeon Scalable Processors (Cascade Lake 8252C) (an AWS m5zn.metal instance). All the same settings and considerations as in the measurements above.
> 
> benchmark | baseline | fast-Locking | % | size
> -- | -- | -- | -- | --
> avrora | 127690 | 126749 | 0.74% | large
> batik | 12736 | 12641 | 0.75% | large
> biojava | 15423 | 15404 | 0.12% | default
> eclipse | 41174 | 41498 | -0.78% | large
> fop | 2184 | 2172 | 0.55% | default
> graphchi | 1579 | 1560 | 1.22% | default
> h2 | 227614 | 230040 | -1.05% | huge
> jme | 8591 | 8398 | 2.30% | default
> jython | 13473 | 13356 | 0.88% | default
> luindex | 824 | 813 | 1.35% | default
> lusearch | 962 | 968 | -0.62% | default
> pmd | 40827 | 39654 | 2.96% | large
> sunflow | 53362 | 43475 | 22.74% | large
> tomcat | 27549 | 28029 | -1.71% | large
> tradebeans | 190757 | 190994 | -0.12% | huge
> tradesoap | 68099 | 67934 | 0.24% | huge
> xalan | 7969 | 8178 | -2.56% | default
> zxing | 1176 | 1148 | 2.44% | default
> 
> #### Renaissance/AArch64
> 
> This tests Renaissance/JMH version 0.14.1 on same machines as DaCapo above, with same JVM settings.
> 
> benchmark | baseline | fast-locking | %
> -- | -- | -- | --
> AkkaUct | 2558.832 | 2513.594 | 1.80%
> Reactors | 14715.626 | 14311.246 | 2.83%
> Als | 1851.485 | 1869.622 | -0.97%
> ChiSquare | 1007.788 | 1003.165 | 0.46%
> GaussMix | 1157.491 | 1149.969 | 0.65%
> LogRegression | 717.772 | 733.576 | -2.15%
> MovieLens | 7916.181 | 8002.226 | -1.08%
> NaiveBayes | 395.296 | 386.611 | 2.25%
> PageRank | 4294.939 | 4346.333 | -1.18%
> FjKmeans | 519.2 | 498.357 | 4.18%
> FutureGenetic | 2578.504 | 2589.255 | -0.42%
> Mnemonics | 4898.886 | 4903.689 | -0.10%
> ParMnemonics | 4260.507 | 4210.121 | 1.20%
> Scrabble | 139.37 | 138.312 | 0.76%
> RxScrabble | 320.114 | 322.651 | -0.79%
> Dotty | 1056.543 | 1068.492 | -1.12%
> ScalaDoku | 3443.117 | 3449.477 | -0.18%
> ScalaStmBench7 | 1102.43 | 1115.142 | -1.14%
> FinagleChirper | 6814.192 | 6853.38 | -0.57%
> FinagleHttp | 4762.902 | 4807.564 | -0.93%
> 
> #### Renaissance/x86_64
> 
> benchmark | baseline | fast-locking | %
> -- | -- | -- | --
> AkkaUct | 1117.185 | 1116.425 | 0.07%
> Reactors | 11561.354 | 11812.499 | -2.13%
> Als | 1580.838 | 1575.318 | 0.35%
> ChiSquare | 459.601 | 467.109 | -1.61%
> GaussMix | 705.944 | 685.595 | 2.97%
> LogRegression | 659.944 | 656.428 | 0.54%
> MovieLens | 7434.303 | 7592.271 | -2.08%
> NaiveBayes | 413.482 | 417.369 | -0.93%
> PageRank | 3259.233 | 3276.589 | -0.53%
> FjKmeans | 946.429 | 938.991 | 0.79%
> FutureGenetic | 1760.672 | 1815.272 | -3.01%
> Scrabble | 147.996 | 150.084 | -1.39%
> RxScrabble | 177.755 | 177.956 | -0.11%
> Dotty | 673.754 | 683.919 | -1.49%
> ScalaKmeans | 165.376 | 168.925 | -2.10%
> ScalaStmBench7 | 1080.187 | 1049.184 | 2.95%
> 
> Some renaissance benchmarks are missing: DecTree, DbShootout and Neo4jAnalytics are not compatible with JDK20. The remaining benchmarks show very high run-to-run variance, which I am investigating (and probably addressing with running them much more often.
> 
> Please let me know if you want me to run any other workloads, or, even better, run them yourself and report here.
> 
> ### Testing
>  - [x] tier1 (x86_64, aarch64, x86_32)
>  - [x] tier2 (x86_64, aarch64)
>  - [x] tier3 (x86_64, aarch64)
>  - [x] tier4 (x86_64, aarch64)

The bar for acceptance for a brand new locking scheme with no fallback is extremely high and needs a lot of bake time and broad performance measurements, to watch for pathologies. That bar is lower if the scheme can be reverted to the old code if needed; and even lower still if the scheme is opt-in in the first place. For Java Object Monitors I made the new mechanism opt-in so the same could be done here. Granted it is not a trivial effort to do that, but I think a phased approach to transition to the new scheme is essential. It could be implemented as an experimental feature initially.

I am not aware, please refresh my memory if you know different, of any core hotspot subsystem just being replaced in one fell swoop in one single release. Yes this needs a lot of testing but customers are not beta-testers. If this goes into a release on by default then there must be a way for customers to turn it off. UseHeavyMonitors is not a fallback as it is not for production use itself. So the new code has to co-exist along-side the old code as we make a transition across 2-3 releases. And yes that means a double-up on some testing as we already do for many things.

Any fast locking scheme benefits the uncontended sync case. So if you have a lot of contention and therefore a lot of inflation, the fast locking won't show any benefit. What "modern workloads" are you using to measure this? We eventually got rid of biased-locking because it no longer showed any benefit, so it is possible that fast locking (of whichever form) could go the same way. And we may have moved past heavy use of synchronized in general for that matter, especially as Loom instigated many changes over to java.util.concurrent locks.

Is UseHeavyMonitors in good enough shape to reliably be used for benchmark comparisons?

I don't have github notification enabled so I missed this discussion.

The JVMS permits lock A, lock B, unlock A, unlock B, in bytecode - i.e it passes verification and it does not violate the structured locking rules. It probably also passes verification if there is no exception table entries such that the unlocks are guaranteed to happen - regardless of the order. IIUC from above the VM will actually unlock all monitors for which there is a lock-record in the activation when the activation returns. The order in which it does that may be different to how the program would have done it but I don't see how that makes any difference to anything.

-------------

PR: https://git.openjdk.org/jdk/pull/9680