RFR: 8291555: Replace stack-locking with fast-locking

Thu Oct 6 07:47:06 UTC 2022

On Mon, 8 Aug 2022 12:14:38 GMT, David Holmes <dholmes at openjdk.org> wrote:

> The bar for acceptance for a brand new locking scheme with no fallback is extremely high and needs a lot of bake time and broad performance measurements, to watch for pathologies. That bar is lower if the scheme can be reverted to the old code if needed; and even lower still if the scheme is opt-in in the first place. For Java Object Monitors I made the new mechanism opt-in so the same could be done here. Granted it is not a trivial effort to do that, but I think a phased approach to transition to the new scheme is essential. It could be implemented as an experimental feature initially.

Reverting a change should not be difficult. (Unless maybe another major change arrived in the meantime, which makes reverse-applying a patch non-trivial.)

I'm skeptical to implement an opt-in runtime-switch, though.
- Keeping the old paths side-by-side with the new paths is an engineering effort in itself, as you point out. It means that it, too, introduces significant risks to break locking, one way or the other (or both).
- Making the new path opt-in means that we achieve almost nothing by it: testing code would still normally run the old paths (hopefully we didn't break it by making the change), and only use the new paths when explicitely told so, and I don't expect that many people voluntarily do that. It *may* be more useful to make it opt-out, as a quick fix if anybody experiences troubles with it.
- Do we need runtime-switchable opt-in or opt-out flag for the initial testing and baking? I wouldn't think so: it seems better and cleaner to take the Git branch of this PR and put it through all relevant testing before the change goes in.
- For how long do you think the runtime switch should stay? Because if it's all but temporary, it means we better test both paths thoroughly and automated. And it may also mean extra maintenance work (with extra avenues for bugs, see above), too.

> I am not aware, please refresh my memory if you know different, of any core hotspot subsystem just being replaced in one fell swoop in one single release. Yes this needs a lot of testing but customers are not beta-testers. If this goes into a release on by default then there must be a way for customers to turn it off. UseHeavyMonitors is not a fallback as it is not for production use itself. So the new code has to co-exist along-side the old code as we make a transition across 2-3 releases. And yes that means a double-up on some testing as we already do for many things.

I believe the least risky path overall is to make UseHeavyMonitors a production flag. Then it can act as a kill-switch for the new locking code, should anything go bad. I even considered to remove stack-locking altogether, and could only show minor performance impact, and always only in code that uses obsolete synchronized Java collections like Vector, Stack and StringBuffer.
If you'd argue that it's too risky to use UseHeavyMonitors for that - then certainly you understand that the risk of introducing a new flag and manage two stack-locking subsystems would be even higher.

There's a lot of code that is risky in itself to keep both paths. For example, I needed to change register allocation in the C2 .ad declarations and also in the interpreter/generated assembly code. It's hard enough to see that it is correct for one of the implementations, and much harder to implement and verify this correctly for two.

> Any fast locking scheme benefits the uncontended sync case. So if you have a lot of contention and therefore a lot of inflation, the fast locking won't show any benefit.

Not only that. As far as I can tell, 'heavy monitors' would only be worse off in workloads that 1. use uncontended sync and 2. churns monitors. Lots of uncontended sync on the same monitor object is not actually worse than fast-locking (it boils down to a single CAS in both cases). It only gets bad when code keeps allocating short-lived objects and syncs on them once or a few times only, and then moves on to the next new sync objects.

> What "modern workloads" are you using to measure this?

So far I tested with SPECjbb and SPECjvm-workloads-transplanted-into-JMH, dacapo and renaissance. I could only measure regressions with heavy monitors in workloads that use XML/XSLT, which I found out is because the XSTL compiler generates code that uses StringBuffer for (single-threaded) parsing. I also found a few other places in XML where usage of Stack and Vector has some impact. I can provide fixes for those, if needed (but I'm not sure whether this should go into JDK, upstream Xalan/Xerxes or both).

> We eventually got rid of biased-locking because it no longer showed any benefit, so it is possible that fast locking (of whichever form) could go the same way. And we may have moved past heavy use of synchronized in general for that matter, especially as Loom instigated many changes over to java.util.concurrent locks.

Yup.

> Is UseHeavyMonitors in good enough shape to reliably be used for benchmark comparisons?

Yes, except that the flag would have to be made product. Also, it is useful to use this PR instead of upstream JDK, because it simplifies the inflation protocol pretty much like it would be simplified without any stack-locking. I can make a standalone PR that gets rid of stack-locking altogether, if that is useful.

Also keep in mind that both this fast-locking PR and total removal of stack-locking would enable some follow-up improvements: we'd no longer have to inflate monitors in order to install or read an i-hashcode. And GC code similarily may benefit from easier read/write of object age bits. This might benefit generational concurrent GC efforts.

-------------

PR: https://git.openjdk.org/jdk/pull/9680