RFR: 8319801: Recursive lightweight locking: aarch64 implementation
Andrew Haley
aph at openjdk.org
Fri Nov 17 08:43:32 UTC 2023
On Thu, 16 Nov 2023 11:59:26 GMT, Axel Boldt-Christmas <aboldtch at openjdk.org> wrote:
> > Hmm. Which hardware is this? This is stuff I need to be aware of. Please contact me off-line if it's hard to say in public.
>
> This has been observed with different versions of the Apple M1
> processors.
Heh, go figure. This is even more remarkable, given that Apple's own
compilers seem to always emit LSE CAS, at least by default.
I suppose it's understandable that Apple didn't focus much on LSE
performance, given that M1 and its derivatives aren't intended for
multi-socket designs, and are based on a highly-integrated cellphone
design. But no one told the compiler team.
> To clarify, when I say contention I am referring to java monitor
> contention, that is, multiple threads are trying to lock the same
> object.
>
> The performance is particularly bad if the LSE CAS fails. This
> pattern is something that is prevalent in the un-contended inflated
> recursive lock. In the current implementation this is still an
> issue, but as we are removing most of the common reason why a
> un-contended lock gets inflated we should not see this as often.
>
> We have at some point also had some code which improves this (e.g.
> https://github.com/xmas92/jdk/blob/3150426b261bfceacdceda1b2ebccd82b6e6fb41/src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp#L162-L167
> ) But I did not want to also change the inflated lock / unlock paths
> in this PR.
Good idea.
> We also have had tried different recursive lightweight unlock paths,
> some where avoiding the LSE CAS has been more important. In the
> current PR it is less important as we make decisions based on the
> state of the lock stack first. This avoids most of the cases of
> un-contended failing CASes that occur in the main line
> implementation. However it still seemed to be more performant on
> this hardware to use LL-SC pair.
>
> Here
Where? I see a bunch of patches, but not results.
> are some microbenchmarks running on an Apple M1 Pro chip. This
> is an extended version of the LockUnlock.java JMH micros. (Patch
> [3a7eb13](https://github.com/openjdk/jdk/commit/3a7eb137140971f6b21ffea5dbf512300b38371a))
> Extended because some of the tests never get compiled because C2
> bails out. (Clearly identified in the results as they are an order
> of magnitude worse).
It's really important not to complexify the AArch64 port for the sake
of one manufacturer, no matter how important. If this stuff can be
done by providing a hint to the CAS macros that ldx/stx is to be
preferred in the Apple case, then it would be tolerable. My primary
goal is protecting the AArch64 back end from well-intentioned
maintainers. And I suspect that this uncontended case isn't that
important to real-world workloads, but I suppose it may be in some
cases..
(And, for avoidance of doubt, I suspect that the Neoverse designs are
far more important for Java. But I have no evidence to prove that.)
-------------
PR Comment: https://git.openjdk.org/jdk/pull/16608#issuecomment-1815946091
More information about the hotspot-dev
mailing list