RFR: 8331411: Shenandoah: Reconsider spinning duration in ShenandoahLock

Thu Jun 13 18:40:44 UTC 2024

### Notes
While doing CAS to get the lock, original implementation sleep/yield once after spinning 0xFFF times, and do these over and over again until get the lock successfully, it is like ```(N spins + sleep/yield) loop ```, based on test results, it seems doing more spins results in worse performance, we decided to change the algorithm to ```(N spins) + (yield loop)```, meanwhile block thread immediately if Safepoint is pending. But still need to determine the best N value for spins, tested multiple possible values: 0, 0x0, 0x1, 0x3, 0x7, 0xF, 0x1F, 0x3F, 0x7F, 0xFF, and compare the results with the baseline data(original implementation).

Also noticed there was regression in Dacapo  h2 benchmark, after deep dive and debug we decided to let non-java threads to only spin, which  favors GC threads a little over Java threads at contented lock. Some follow-up tasks will be taken to reduce lock contention from Shenandoah GC, e.g. https://bugs.openjdk.org/browse/JDK-8334147

#### Test code

public class Alloc {
	static final int THREADS = 1280; //32 threads per CPU core, 40 cores
	static final Object[] sinks = new Object[64*THREADS];
	static volatile boolean start;
	static volatile boolean stop;

	public static void main(String... args) throws Throwable {
			for (int t = 0; t < THREADS; t++) {
					int ft = t;
					new Thread(() -> work(ft * 64)).start();
			}

			Thread.sleep(1000);
			start = true;
			Thread.sleep(30_000);
			stop = true;
	}

	public static void work(int idx) {
			while (!start) { Thread.onSpinWait(); }
			while (!stop) {
					sinks[idx] = new byte[128];
			}
	}
}

Run it like this and observe TTSP times:

java -Xms256m -Xmx256m -XX:+UseShenandoahGC -XX:-UseTLAB -Xlog:gc -Xlog:safepoint Alloc.java

#### Metrics from tests(TTSP, allocation rate)
##### Heavy contention(1280 threads, 32 per CPU core)
| Test       | Count | AVG    | TRIMMEAN 2% | MAX      | MIN   | AVG allocation rate(M/s) |
| ---------- | ----- | ------ | ----------- | -------- | ----- | ------------------------ |
| Baseline   | 19    | 940270 | 940270      | 5956928  | 75562 | 23.34                    |
| No spin    | 164   | 222958 | 204822      | 3330053  | 53819 | 238.9                    |
| 0x01 | 172   | 189173 | 186601      | 750715   | 64864 | 244.1                    |
| 0x03      | 174   | 286892 | 217739      | 12412225 | 55891 | 239.5                    |
| 0x07       | 187   | 194440 | 183894      | 2284615  | 55256 | 235.9                    |
| 0x0F       | 182   | 657904 | 354898      | 55801594 | 55239 | 232.6                    |
| 0x1F       | 159   | 230107 | 213839      | 2973310  | 41083 | 212.3                    |
| 0x3F       | 139   | 284702 | 272033      | 2258179  | 46941 | 197                      |
| 0x7F       | 171   | 511573 | 280797      | 39980104 | 44162 | 197.53                   |
| 0xFF       | 158   | 608347 | 348863      | 41647457 | 48753 | 175                      |
##### Light contention(40 threads, 1 per CPU core)
| Test       | Count | AVG    | TRIMMEAN 2% | MAX     | MIN  | allocation rate(M/s) |
| ---------- | ----- | ------ | ----------- | ------- | ---- | -------------------- |
| Baseline   | 86    | 188536 | 188536      | 3159699 | 5072 | 164.7249794          |
| No spin    | 128   | 23431  | 66482       | 128061  | 4332 | 258.7864942          |
| 0x01 | 132   | 28958  | 24035       | 689788  | 3164 | 262.067242           |
| 0x03       | 128   | 47344  | 22843       | 3175792 | 5932 | 259.4006259          |
| 0x07       | 125   | 21319  | 20566       | 128729  | 6494 | 252.633891           |
| 0x0F       | 125   | 36923  | 20528       | 2084257 | 6245 | 255.295953           |
| 0x1F       | 119   | 25430  | 23788       | 236583  | 6383 | 241.1196582          |
| 0x3F       | 116   | 34549  | 29085       | 685559  | 6532 | 230.5788333          |
| 0x7F       | 104   | 50870  | 49425       | 243020  | 6119 | 203.4185628          |
| 0xFF       | 95    | 101771 | 101771      | 2081392 | 2551 | 188.4154766          |

Overall the results shows less spin delvers better performance in terms of TTSP and allocation rate, but it seems too aggressive to spin only once or no spin at all, still want to be conservative for the spin count, we decided to use 0x1F.

-------------

Commit messages:
 - Merge branch 'openjdk:master' into JDK-8331411
 - minor adjustments
 - Minor fix
 - Merge branch 'openjdk:master' into JDK-8331411
 - Typo
 - Bug fix
 - Revert the fix specifically h2 and polish the code
 - Yield up to 5 times
 - Apply TTAS for fast lock
 - Remove trailing whitespace
 - ... and 5 more: https://git.openjdk.org/jdk/compare/cff048c7...d5d8b65f

Changes: https://git.openjdk.org/jdk/pull/19570/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=19570&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8331411
  Stats: 49 lines in 2 files changed: 20 ins; 17 del; 12 mod
  Patch: https://git.openjdk.org/jdk/pull/19570.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/19570/head:pull/19570

PR: https://git.openjdk.org/jdk/pull/19570