RFR: 8319801: Recursive lightweight locking: aarch64 implementation
Axel Boldt-Christmas
aboldtch at openjdk.org
Thu Nov 16 12:02:31 UTC 2023
On Wed, 15 Nov 2023 09:57:40 GMT, Andrew Haley <aph at openjdk.org> wrote:
> Hmm. Which hardware is this? This is stuff I need to be aware of. Please contact me off-line if it's hard to say in public.
This has been observed with different versions of the Apple M1 processors.
To clarify, when I say contention I am referring to java monitor contention, that is, multiple threads are trying to lock the same object.
The performance is particularly bad if the LSE CAS fails. This pattern is something that is prevalent in the un-contended inflated recursive lock. In the current implementation this is still an issue, but as we are removing most of the common reason why a un-contended lock gets inflated we should not see this as often.
We have at some point also had some code which improves this (e.g. https://github.com/xmas92/jdk/blob/3150426b261bfceacdceda1b2ebccd82b6e6fb41/src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp#L162-L167 ) But I did not want to also change the inflated lock / unlock paths in this PR.
We also have had tried different recursive lightweight unlock paths, some where avoiding the LSE CAS has been more important. In the current PR it is less important as we make decisions based on the state of the lock stack first. This avoids most of the cases of un-contended failing CASes that occur in the main line implementation.
However it still seemed to be more performant on this hardware to use LL-SC pair.
Here are some microbenchmarks running on an Apple M1 Pro chip. This is an extended version of the LockUnlock.java JMH micros. (Patch 3a7eb137140971f6b21ffea5dbf512300b38371a) Extended because some of the tests never get compiled because C2 bails out. (Clearly identified in the results as they are an order of magnitude worse).
<details>
<summary>Base(c80e691adf6f9ac1a41b2329ce366710e604e34e) Legacy -UseLSE</summary>
<pre>
Benchmark (innerCount) Mode Cnt Score Error Units
LockUnlock.testContendedLock 100 avgt 4 77,003 ? 7,558 ns/op
LockUnlock.testMonitorRecursiveLockUnlock 100 avgt 4 1280,276 ? 11,565 ns/op
LockUnlock.testMonitorRecursiveLockUnlockLocal 100 avgt 4 16525,732 ? 222,518 ns/op
LockUnlock.testMonitorRecursiveOnlyLockUnlock 100 avgt 4 602,364 ? 18,365 ns/op
LockUnlock.testMonitorRecursiveOnlyLockUnlockLocal 100 avgt 4 8984,140 ? 389,655 ns/op
LockUnlock.testRecursiveLockUnlock 100 avgt 4 1804,546 ? 14,954 ns/op
LockUnlock.testRecursiveLockUnlockLocal 100 avgt 4 3504,367 ? 48,076 ns/op
LockUnlock.testRecursiveSynchronization 100 avgt 4 40,477 ? 11,088 ns/op
LockUnlock.testSerialLockUnlock 100 avgt 4 2275,810 ? 222,888 ns/op
LockUnlock.testSerialLockUnlockLocal 100 avgt 4 1135,063 ? 9,118 ns/op
LockUnlock.testSimpleLockUnlock 100 avgt 4 1130,178 ? 58,801 ns/op
LockUnlock.testSimpleLockUnlockLocal 100 avgt 4 1134,359 ? 8,701 ns/op
</pre>
</details>
<details>
<summary>Base(c80e691adf6f9ac1a41b2329ce366710e604e34e) Legacy +UseLSE</summary>
<pre>
Benchmark (innerCount) Mode Cnt Score Error Units
LockUnlock.testContendedLock 100 avgt 4 52,511 ? 24,029 ns/op
LockUnlock.testMonitorRecursiveLockUnlock 100 avgt 4 2473,421 ? 117,243 ns/op
LockUnlock.testMonitorRecursiveLockUnlockLocal 100 avgt 4 22371,364 ? 1364,761 ns/op
LockUnlock.testMonitorRecursiveOnlyLockUnlock 100 avgt 4 1106,888 ? 26,179 ns/op
LockUnlock.testMonitorRecursiveOnlyLockUnlockLocal 100 avgt 4 12081,724 ? 793,498 ns/op
LockUnlock.testRecursiveLockUnlock 100 avgt 4 3265,306 ? 214,527 ns/op
LockUnlock.testRecursiveLockUnlockLocal 100 avgt 4 3586,900 ? 165,551 ns/op
LockUnlock.testRecursiveSynchronization 100 avgt 4 88,162 ? 3,763 ns/op
LockUnlock.testSerialLockUnlock 100 avgt 4 1891,455 ? 67,336 ns/op
LockUnlock.testSerialLockUnlockLocal 100 avgt 4 943,267 ? 39,638 ns/op
LockUnlock.testSimpleLockUnlock 100 avgt 4 958,670 ? 24,282 ns/op
LockUnlock.testSimpleLockUnlockLocal 100 avgt 4 930,944 ? 13,019 ns/op
</pre>
</details>
<details>
<summary>Base(c80e691adf6f9ac1a41b2329ce366710e604e34e) Lightweight -UseLSE</summary>
<pre>
Benchmark (innerCount) Mode Cnt Score Error Units
LockUnlock.testContendedLock 100 avgt 4 51,767 ? 1,708 ns/op
LockUnlock.testMonitorRecursiveLockUnlock 100 avgt 4 1320,017 ? 12,844 ns/op
LockUnlock.testMonitorRecursiveLockUnlockLocal 100 avgt 4 15297,789 ? 538,970 ns/op
LockUnlock.testMonitorRecursiveOnlyLockUnlock 100 avgt 4 599,823 ? 13,903 ns/op
LockUnlock.testMonitorRecursiveOnlyLockUnlockLocal 100 avgt 4 8181,012 ? 266,438 ns/op
LockUnlock.testRecursiveLockUnlock 100 avgt 4 1285,344 ? 9,739 ns/op
LockUnlock.testRecursiveLockUnlockLocal 100 avgt 4 15249,363 ? 59,621 ns/op
LockUnlock.testRecursiveSynchronization 100 avgt 4 33,060 ? 0,260 ns/op
LockUnlock.testSerialLockUnlock 100 avgt 4 2550,867 ? 32,597 ns/op
LockUnlock.testSerialLockUnlockLocal 100 avgt 4 1274,052 ? 6,240 ns/op
LockUnlock.testSimpleLockUnlock 100 avgt 4 1286,234 ? 65,275 ns/op
LockUnlock.testSimpleLockUnlockLocal 100 avgt 4 1278,423 ? 11,065 ns/op
</pre>
</details>
<details>
<summary>Base(c80e691adf6f9ac1a41b2329ce366710e604e34e) Lightweight +UseLSE</summary>
<pre>
Benchmark (innerCount) Mode Cnt Score Error Units
LockUnlock.testContendedLock 100 avgt 4 93,536 ? 2,062 ns/op
LockUnlock.testMonitorRecursiveLockUnlock 100 avgt 4 2993,243 ? 49,181 ns/op
LockUnlock.testMonitorRecursiveLockUnlockLocal 100 avgt 4 16840,772 ? 86,835 ns/op
LockUnlock.testMonitorRecursiveOnlyLockUnlock 100 avgt 4 1949,685 ? 10,739 ns/op
LockUnlock.testMonitorRecursiveOnlyLockUnlockLocal 100 avgt 4 8992,361 ? 42,743 ns/op
LockUnlock.testRecursiveLockUnlock 100 avgt 4 3129,174 ? 77,245 ns/op
LockUnlock.testRecursiveLockUnlockLocal 100 avgt 4 16841,642 ? 237,059 ns/op
LockUnlock.testRecursiveSynchronization 100 avgt 4 107,438 ? 5,077 ns/op
LockUnlock.testSerialLockUnlock 100 avgt 4 2657,087 ? 65,913 ns/op
LockUnlock.testSerialLockUnlockLocal 100 avgt 4 1328,323 ? 85,543 ns/op
LockUnlock.testSimpleLockUnlock 100 avgt 4 1310,857 ? 17,261 ns/op
LockUnlock.testSimpleLockUnlockLocal 100 avgt 4 1311,644 ? 39,859 ns/op
</pre>
</details>
<details>
<summary>Recursive Lightweight(1e7a586c027b6c84f42f317381e6b35ebb45cea0) -UseLSE</summary>
<pre>
Benchmark (innerCount) Mode Cnt Score Error Units
LockUnlock.testContendedLock 100 avgt 4 66,658 ? 4,420 ns/op
LockUnlock.testMonitorRecursiveLockUnlock 100 avgt 4 1288,176 ? 14,966 ns/op
LockUnlock.testMonitorRecursiveLockUnlockLocal 100 avgt 4 15743,745 ? 293,414 ns/op
LockUnlock.testMonitorRecursiveOnlyLockUnlock 100 avgt 4 611,030 ? 8,646 ns/op
LockUnlock.testMonitorRecursiveOnlyLockUnlockLocal 100 avgt 4 8273,894 ? 54,006 ns/op
LockUnlock.testRecursiveLockUnlock 100 avgt 4 885,686 ? 5,822 ns/op
LockUnlock.testRecursiveLockUnlockLocal 100 avgt 4 3678,847 ? 6,472 ns/op
LockUnlock.testRecursiveSynchronization 100 avgt 4 38,393 ? 9,834 ns/op
LockUnlock.testSerialLockUnlock 100 avgt 4 1653,768 ? 10,920 ns/op
LockUnlock.testSerialLockUnlockLocal 100 avgt 4 829,223 ? 2,152 ns/op
LockUnlock.testSimpleLockUnlock 100 avgt 4 830,576 ? 24,810 ns/op
LockUnlock.testSimpleLockUnlockLocal 100 avgt 4 835,194 ? 66,321 ns/op
</pre>
</details>
<details>
<summary>Recursive Lightweight(1e7a586c027b6c84f42f317381e6b35ebb45cea0) +UseLSE</summary>
<pre>
Benchmark (innerCount) Mode Cnt Score Error Units
LockUnlock.testContendedLock 100 avgt 4 85,688 ? 17,538 ns/op
LockUnlock.testMonitorRecursiveLockUnlock 100 avgt 4 2334,429 ? 70,698 ns/op
LockUnlock.testMonitorRecursiveLockUnlockLocal 100 avgt 4 15601,593 ? 480,278 ns/op
LockUnlock.testMonitorRecursiveOnlyLockUnlock 100 avgt 4 1065,708 ? 10,372 ns/op
LockUnlock.testMonitorRecursiveOnlyLockUnlockLocal 100 avgt 4 8239,642 ? 98,829 ns/op
LockUnlock.testRecursiveLockUnlock 100 avgt 4 885,525 ? 1,831 ns/op
LockUnlock.testRecursiveLockUnlockLocal 100 avgt 4 3647,819 ? 120,980 ns/op
LockUnlock.testRecursiveSynchronization 100 avgt 4 89,187 ? 0,787 ns/op
LockUnlock.testSerialLockUnlock 100 avgt 4 1661,228 ? 24,971 ns/op
LockUnlock.testSerialLockUnlockLocal 100 avgt 4 837,762 ? 43,297 ns/op
LockUnlock.testSimpleLockUnlock 100 avgt 4 829,542 ? 11,918 ns/op
LockUnlock.testSimpleLockUnlockLocal 100 avgt 4 828,762 ? 3,844 ns/op
</pre>
</details>
<details>
<summary>Recursive Lightweight (+ Patch switch to CAS over LL-SC 8dbe0762b98c1427d1588795d77ea73e306d045d) -UseLSE</summary>
<pre>
Benchmark (innerCount) Mode Cnt Score Error Units
LockUnlock.testContendedLock 100 avgt 4 94,994 ? 19,096 ns/op
LockUnlock.testMonitorRecursiveLockUnlock 100 avgt 4 1258,710 ? 4,664 ns/op
LockUnlock.testMonitorRecursiveLockUnlockLocal 100 avgt 4 15381,962 ? 84,907 ns/op
LockUnlock.testMonitorRecursiveOnlyLockUnlock 100 avgt 4 597,632 ? 1,807 ns/op
LockUnlock.testMonitorRecursiveOnlyLockUnlockLocal 100 avgt 4 8212,172 ? 125,500 ns/op
LockUnlock.testRecursiveLockUnlock 100 avgt 4 933,620 ? 45,059 ns/op
LockUnlock.testRecursiveLockUnlockLocal 100 avgt 4 3631,726 ? 23,656 ns/op
LockUnlock.testRecursiveSynchronization 100 avgt 4 36,777 ? 0,349 ns/op
LockUnlock.testSerialLockUnlock 100 avgt 4 1764,221 ? 6,173 ns/op
LockUnlock.testSerialLockUnlockLocal 100 avgt 4 889,761 ? 1,720 ns/op
LockUnlock.testSimpleLockUnlock 100 avgt 4 895,285 ? 9,457 ns/op
LockUnlock.testSimpleLockUnlockLocal 100 avgt 4 889,444 ? 5,734 ns/op
</pre>
</details>
<details>
<summary>Recursive Lightweight (+ Patch switch to CAS over LL-SC 8dbe0762b98c1427d1588795d77ea73e306d045d) +UseLSE</summary>
<pre>
Benchmark (innerCount) Mode Cnt Score Error Units
LockUnlock.testContendedLock 100 avgt 4 74,835 ? 9,992 ns/op
LockUnlock.testMonitorRecursiveLockUnlock 100 avgt 4 2299,803 ? 6,954 ns/op
LockUnlock.testMonitorRecursiveLockUnlockLocal 100 avgt 4 15452,039 ? 776,829 ns/op
LockUnlock.testMonitorRecursiveOnlyLockUnlock 100 avgt 4 1067,769 ? 6,606 ns/op
LockUnlock.testMonitorRecursiveOnlyLockUnlockLocal 100 avgt 4 8219,391 ? 46,559 ns/op
LockUnlock.testRecursiveLockUnlock 100 avgt 4 944,968 ? 57,425 ns/op
LockUnlock.testRecursiveLockUnlockLocal 100 avgt 4 3633,174 ? 66,667 ns/op
LockUnlock.testRecursiveSynchronization 100 avgt 4 88,720 ? 0,754 ns/op
LockUnlock.testSerialLockUnlock 100 avgt 4 1720,471 ? 58,517 ns/op
LockUnlock.testSerialLockUnlockLocal 100 avgt 4 885,344 ? 39,917 ns/op
LockUnlock.testSimpleLockUnlock 100 avgt 4 864,052 ? 35,072 ns/op
LockUnlock.testSimpleLockUnlockLocal 100 avgt 4 879,373 ? 5,401 ns/op
</pre>
</details>
I agree that this should commented. And probably tracked somewhere in JBS.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/16608#issuecomment-1814306544
More information about the hotspot-dev
mailing list