RFR: 8319801: Recursive lightweight locking: aarch64 implementation

Axel Boldt-Christmas aboldtch at openjdk.org
Thu Nov 16 12:02:31 UTC 2023


On Wed, 15 Nov 2023 09:57:40 GMT, Andrew Haley <aph at openjdk.org> wrote:

> Hmm. Which hardware is this? This is stuff I need to be aware of. Please contact me off-line if it's hard to say in public.

This has been observed with different versions of the Apple M1 processors.

To clarify, when I say contention I am referring to java monitor contention, that is, multiple threads are trying to lock the same object. 

The performance is particularly bad if the LSE CAS fails. This pattern is something that is prevalent in the un-contended inflated recursive lock. In the current implementation this is still an issue, but as we are removing most of the common reason why a un-contended lock gets inflated we should not see this as often. 

We have at some point also had some code which improves this (e.g. https://github.com/xmas92/jdk/blob/3150426b261bfceacdceda1b2ebccd82b6e6fb41/src/hotspot/cpu/aarch64/c2_MacroAssembler_aarch64.cpp#L162-L167 ) But I did not want to also change the inflated lock / unlock paths in this PR.

We also have had tried different recursive lightweight unlock paths, some where avoiding the LSE CAS has been more important. In the current PR it is less important as we make decisions based on the state of the lock stack first. This avoids most of the cases of un-contended failing CASes that occur in the main line implementation. 
However it still seemed to be more performant on this hardware to use LL-SC pair. 

Here are some microbenchmarks running on an Apple M1 Pro chip. This is an extended version of the LockUnlock.java JMH micros. (Patch 3a7eb137140971f6b21ffea5dbf512300b38371a) Extended because some of the tests never get compiled because C2 bails out. (Clearly identified in the results as they are an order of magnitude worse).

<details>
  <summary>Base(c80e691adf6f9ac1a41b2329ce366710e604e34e) Legacy -UseLSE</summary>
  <pre>
  Benchmark                                           (innerCount)  Mode  Cnt      Score     Error  Units
  LockUnlock.testContendedLock                                 100  avgt    4     77,003 ?   7,558  ns/op
  LockUnlock.testMonitorRecursiveLockUnlock                    100  avgt    4   1280,276 ?  11,565  ns/op
  LockUnlock.testMonitorRecursiveLockUnlockLocal               100  avgt    4  16525,732 ? 222,518  ns/op
  LockUnlock.testMonitorRecursiveOnlyLockUnlock                100  avgt    4    602,364 ?  18,365  ns/op
  LockUnlock.testMonitorRecursiveOnlyLockUnlockLocal           100  avgt    4   8984,140 ? 389,655  ns/op
  LockUnlock.testRecursiveLockUnlock                           100  avgt    4   1804,546 ?  14,954  ns/op
  LockUnlock.testRecursiveLockUnlockLocal                      100  avgt    4   3504,367 ?  48,076  ns/op
  LockUnlock.testRecursiveSynchronization                      100  avgt    4     40,477 ?  11,088  ns/op
  LockUnlock.testSerialLockUnlock                              100  avgt    4   2275,810 ? 222,888  ns/op
  LockUnlock.testSerialLockUnlockLocal                         100  avgt    4   1135,063 ?   9,118  ns/op
  LockUnlock.testSimpleLockUnlock                              100  avgt    4   1130,178 ?  58,801  ns/op
  LockUnlock.testSimpleLockUnlockLocal                         100  avgt    4   1134,359 ?   8,701  ns/op
  </pre>
</details>
<details>
  <summary>Base(c80e691adf6f9ac1a41b2329ce366710e604e34e) Legacy +UseLSE</summary>
  <pre>
  Benchmark                                           (innerCount)  Mode  Cnt      Score      Error  Units
  LockUnlock.testContendedLock                                 100  avgt    4     52,511 ?   24,029  ns/op
  LockUnlock.testMonitorRecursiveLockUnlock                    100  avgt    4   2473,421 ?  117,243  ns/op
  LockUnlock.testMonitorRecursiveLockUnlockLocal               100  avgt    4  22371,364 ? 1364,761  ns/op
  LockUnlock.testMonitorRecursiveOnlyLockUnlock                100  avgt    4   1106,888 ?   26,179  ns/op
  LockUnlock.testMonitorRecursiveOnlyLockUnlockLocal           100  avgt    4  12081,724 ?  793,498  ns/op
  LockUnlock.testRecursiveLockUnlock                           100  avgt    4   3265,306 ?  214,527  ns/op
  LockUnlock.testRecursiveLockUnlockLocal                      100  avgt    4   3586,900 ?  165,551  ns/op
  LockUnlock.testRecursiveSynchronization                      100  avgt    4     88,162 ?    3,763  ns/op
  LockUnlock.testSerialLockUnlock                              100  avgt    4   1891,455 ?   67,336  ns/op
  LockUnlock.testSerialLockUnlockLocal                         100  avgt    4    943,267 ?   39,638  ns/op
  LockUnlock.testSimpleLockUnlock                              100  avgt    4    958,670 ?   24,282  ns/op
  LockUnlock.testSimpleLockUnlockLocal                         100  avgt    4    930,944 ?   13,019  ns/op
  </pre>
</details>
<details>
  <summary>Base(c80e691adf6f9ac1a41b2329ce366710e604e34e) Lightweight -UseLSE</summary>
  <pre>
  Benchmark                                           (innerCount)  Mode  Cnt      Score     Error  Units
  LockUnlock.testContendedLock                                 100  avgt    4     51,767 ?   1,708  ns/op
  LockUnlock.testMonitorRecursiveLockUnlock                    100  avgt    4   1320,017 ?  12,844  ns/op
  LockUnlock.testMonitorRecursiveLockUnlockLocal               100  avgt    4  15297,789 ? 538,970  ns/op
  LockUnlock.testMonitorRecursiveOnlyLockUnlock                100  avgt    4    599,823 ?  13,903  ns/op
  LockUnlock.testMonitorRecursiveOnlyLockUnlockLocal           100  avgt    4   8181,012 ? 266,438  ns/op
  LockUnlock.testRecursiveLockUnlock                           100  avgt    4   1285,344 ?   9,739  ns/op
  LockUnlock.testRecursiveLockUnlockLocal                      100  avgt    4  15249,363 ?  59,621  ns/op
  LockUnlock.testRecursiveSynchronization                      100  avgt    4     33,060 ?   0,260  ns/op
  LockUnlock.testSerialLockUnlock                              100  avgt    4   2550,867 ?  32,597  ns/op
  LockUnlock.testSerialLockUnlockLocal                         100  avgt    4   1274,052 ?   6,240  ns/op
  LockUnlock.testSimpleLockUnlock                              100  avgt    4   1286,234 ?  65,275  ns/op
  LockUnlock.testSimpleLockUnlockLocal                         100  avgt    4   1278,423 ?  11,065  ns/op
  </pre>
</details>
<details>
  <summary>Base(c80e691adf6f9ac1a41b2329ce366710e604e34e) Lightweight +UseLSE</summary>
  <pre>
  Benchmark                                           (innerCount)  Mode  Cnt      Score     Error  Units
  LockUnlock.testContendedLock                                 100  avgt    4     93,536 ?   2,062  ns/op
  LockUnlock.testMonitorRecursiveLockUnlock                    100  avgt    4   2993,243 ?  49,181  ns/op
  LockUnlock.testMonitorRecursiveLockUnlockLocal               100  avgt    4  16840,772 ?  86,835  ns/op
  LockUnlock.testMonitorRecursiveOnlyLockUnlock                100  avgt    4   1949,685 ?  10,739  ns/op
  LockUnlock.testMonitorRecursiveOnlyLockUnlockLocal           100  avgt    4   8992,361 ?  42,743  ns/op
  LockUnlock.testRecursiveLockUnlock                           100  avgt    4   3129,174 ?  77,245  ns/op
  LockUnlock.testRecursiveLockUnlockLocal                      100  avgt    4  16841,642 ? 237,059  ns/op
  LockUnlock.testRecursiveSynchronization                      100  avgt    4    107,438 ?   5,077  ns/op
  LockUnlock.testSerialLockUnlock                              100  avgt    4   2657,087 ?  65,913  ns/op
  LockUnlock.testSerialLockUnlockLocal                         100  avgt    4   1328,323 ?  85,543  ns/op
  LockUnlock.testSimpleLockUnlock                              100  avgt    4   1310,857 ?  17,261  ns/op
  LockUnlock.testSimpleLockUnlockLocal                         100  avgt    4   1311,644 ?  39,859  ns/op
  </pre>
</details>
<details>
  <summary>Recursive Lightweight(1e7a586c027b6c84f42f317381e6b35ebb45cea0) -UseLSE</summary>
  <pre>
  Benchmark                                           (innerCount)  Mode  Cnt      Score     Error  Units
  LockUnlock.testContendedLock                                 100  avgt    4     66,658 ?   4,420  ns/op
  LockUnlock.testMonitorRecursiveLockUnlock                    100  avgt    4   1288,176 ?  14,966  ns/op
  LockUnlock.testMonitorRecursiveLockUnlockLocal               100  avgt    4  15743,745 ? 293,414  ns/op
  LockUnlock.testMonitorRecursiveOnlyLockUnlock                100  avgt    4    611,030 ?   8,646  ns/op
  LockUnlock.testMonitorRecursiveOnlyLockUnlockLocal           100  avgt    4   8273,894 ?  54,006  ns/op
  LockUnlock.testRecursiveLockUnlock                           100  avgt    4    885,686 ?   5,822  ns/op
  LockUnlock.testRecursiveLockUnlockLocal                      100  avgt    4   3678,847 ?   6,472  ns/op
  LockUnlock.testRecursiveSynchronization                      100  avgt    4     38,393 ?   9,834  ns/op
  LockUnlock.testSerialLockUnlock                              100  avgt    4   1653,768 ?  10,920  ns/op
  LockUnlock.testSerialLockUnlockLocal                         100  avgt    4    829,223 ?   2,152  ns/op
  LockUnlock.testSimpleLockUnlock                              100  avgt    4    830,576 ?  24,810  ns/op
  LockUnlock.testSimpleLockUnlockLocal                         100  avgt    4    835,194 ?  66,321  ns/op
  </pre>
</details>
<details>
  <summary>Recursive Lightweight(1e7a586c027b6c84f42f317381e6b35ebb45cea0) +UseLSE</summary>
  <pre>
  Benchmark                                           (innerCount)  Mode  Cnt      Score     Error  Units
  LockUnlock.testContendedLock                                 100  avgt    4     85,688 ?  17,538  ns/op
  LockUnlock.testMonitorRecursiveLockUnlock                    100  avgt    4   2334,429 ?  70,698  ns/op
  LockUnlock.testMonitorRecursiveLockUnlockLocal               100  avgt    4  15601,593 ? 480,278  ns/op
  LockUnlock.testMonitorRecursiveOnlyLockUnlock                100  avgt    4   1065,708 ?  10,372  ns/op
  LockUnlock.testMonitorRecursiveOnlyLockUnlockLocal           100  avgt    4   8239,642 ?  98,829  ns/op
  LockUnlock.testRecursiveLockUnlock                           100  avgt    4    885,525 ?   1,831  ns/op
  LockUnlock.testRecursiveLockUnlockLocal                      100  avgt    4   3647,819 ? 120,980  ns/op
  LockUnlock.testRecursiveSynchronization                      100  avgt    4     89,187 ?   0,787  ns/op
  LockUnlock.testSerialLockUnlock                              100  avgt    4   1661,228 ?  24,971  ns/op
  LockUnlock.testSerialLockUnlockLocal                         100  avgt    4    837,762 ?  43,297  ns/op
  LockUnlock.testSimpleLockUnlock                              100  avgt    4    829,542 ?  11,918  ns/op
  LockUnlock.testSimpleLockUnlockLocal                         100  avgt    4    828,762 ?   3,844  ns/op
  </pre>
</details>
<details>
  <summary>Recursive Lightweight (+ Patch switch to CAS over LL-SC 8dbe0762b98c1427d1588795d77ea73e306d045d) -UseLSE</summary>
  <pre>
  Benchmark                                           (innerCount)  Mode  Cnt      Score     Error  Units
  LockUnlock.testContendedLock                                 100  avgt    4     94,994 ?  19,096  ns/op
  LockUnlock.testMonitorRecursiveLockUnlock                    100  avgt    4   1258,710 ?   4,664  ns/op
  LockUnlock.testMonitorRecursiveLockUnlockLocal               100  avgt    4  15381,962 ?  84,907  ns/op
  LockUnlock.testMonitorRecursiveOnlyLockUnlock                100  avgt    4    597,632 ?   1,807  ns/op
  LockUnlock.testMonitorRecursiveOnlyLockUnlockLocal           100  avgt    4   8212,172 ? 125,500  ns/op
  LockUnlock.testRecursiveLockUnlock                           100  avgt    4    933,620 ?  45,059  ns/op
  LockUnlock.testRecursiveLockUnlockLocal                      100  avgt    4   3631,726 ?  23,656  ns/op
  LockUnlock.testRecursiveSynchronization                      100  avgt    4     36,777 ?   0,349  ns/op
  LockUnlock.testSerialLockUnlock                              100  avgt    4   1764,221 ?   6,173  ns/op
  LockUnlock.testSerialLockUnlockLocal                         100  avgt    4    889,761 ?   1,720  ns/op
  LockUnlock.testSimpleLockUnlock                              100  avgt    4    895,285 ?   9,457  ns/op
  LockUnlock.testSimpleLockUnlockLocal                         100  avgt    4    889,444 ?   5,734  ns/op
  </pre>
</details>
<details>
  <summary>Recursive Lightweight (+ Patch switch to CAS over LL-SC 8dbe0762b98c1427d1588795d77ea73e306d045d) +UseLSE</summary>
  <pre>
  Benchmark                                           (innerCount)  Mode  Cnt      Score     Error  Units
  LockUnlock.testContendedLock                                 100  avgt    4     74,835 ?   9,992  ns/op
  LockUnlock.testMonitorRecursiveLockUnlock                    100  avgt    4   2299,803 ?   6,954  ns/op
  LockUnlock.testMonitorRecursiveLockUnlockLocal               100  avgt    4  15452,039 ? 776,829  ns/op
  LockUnlock.testMonitorRecursiveOnlyLockUnlock                100  avgt    4   1067,769 ?   6,606  ns/op
  LockUnlock.testMonitorRecursiveOnlyLockUnlockLocal           100  avgt    4   8219,391 ?  46,559  ns/op
  LockUnlock.testRecursiveLockUnlock                           100  avgt    4    944,968 ?  57,425  ns/op
  LockUnlock.testRecursiveLockUnlockLocal                      100  avgt    4   3633,174 ?  66,667  ns/op
  LockUnlock.testRecursiveSynchronization                      100  avgt    4     88,720 ?   0,754  ns/op
  LockUnlock.testSerialLockUnlock                              100  avgt    4   1720,471 ?  58,517  ns/op
  LockUnlock.testSerialLockUnlockLocal                         100  avgt    4    885,344 ?  39,917  ns/op
  LockUnlock.testSimpleLockUnlock                              100  avgt    4    864,052 ?  35,072  ns/op
  LockUnlock.testSimpleLockUnlockLocal                         100  avgt    4    879,373 ?   5,401  ns/op
  </pre>
</details>

I agree that this should commented. And probably tracked somewhere in JBS.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/16608#issuecomment-1814306544


More information about the hotspot-dev mailing list