RFR: 8336640: Shenandoah: Parallel worker use in parallel_heap_region_iterate

Wed Jul 24 19:10:45 UTC 2024

[parallel_heap_region_iterate](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp#L1726-L1734) is used to execute lightweight operations on heap regions, including ShenandoahPrepareForMarkClosure, ShenandoahInitMarkUpdateRegionStateClosure, ShenandoahFinalUpdateRefsUpdateRegionStateClosure, ShenandoahResetUpdateRegionStateClosure and ShenandoahFinalMarkUpdateRegionStateClosure. Since all the operations are very lightweight, in regular cases w/o large number of heap regions, the parallelism seems to be an overkill because the cost of multi-thread orchestrating could be more expensive; In most cases, single thread should be more efficient. Also, if multiple threading is needed, we should maximize the utilization of all active workers for best performance.

This PR includes proposed improvments addressing the known issues:
1. Change the default value of ShenandoahParallelRegionStride to 0, when it is 0, Shenandoah will auto derive the value of stride for best performance; 
2. if num_regions is <= 4096, not use worker threads at all to avoid the overhead of multi-threading;
3. When num_regions is more than 4096, use worker threads to parallelize the workload, derive the value of stride to evenly distribute the workload to all active workers.
4. When number of active workers is 1, don't bother the workers, it is faster to finish the workload in current thread(avoid overhead of multi-threads orchestration)

There are some time metrics I collected from test with TIP version(I added time metrics for parallel_heap_region_iterate):

JVM args: export JAVA_OPTS="-Xms8G -Xmx8G  -XX:+AlwaysPreTouch -XX:+UseShenandoahGC -XX:+UnlockExperimentalVMOptions -XX:ShenandoahParallelRegionStride=<stride> -XX:ShenandoahTargetNumRegions=<num_regions>  -Xlog:gc*"

|             | 1024 regions | 2048 regions | 4096 regions | 8192 regions |16384 regions |
| ----------- | ------------ | ------------ | ------------ | ------------ |------------ |
| 1024 stride | 5785 ns         | 22194 ns        | 20953 ns        | 23008 ns        |33013 ns       |
| 2048 stride | N/A          | 6491 ns         | 22476  ns        | 25842 ns        |34378 ns        |
| 4096 stride | N/A          | N/A          | 14034 ns        | 28425 ns        |36324 ns        |
| 8192 stride | N/A          | N/A          | N/A          | 24359 ns        |45231 ns        |
| 16384 stride | N/A          | N/A          | N/A          | N/A          |53679 ns        |

Basically when we increase stride, less threads are used for parallel iteration, we get worse latency which is expected. when number of regions is same as stride, it won't use mutli-threading, using single thread to process 4096 regions is much better then 4 threads(1024 stride). 

For the PR, also tested with following JVM args:
export JAVA_OPTS="-Xms8G -Xmx8G  -XX:+AlwaysPreTouch -XX:+UseShenandoahGC -XX:+UnlockExperimentalVMOptions  -XX:ShenandoahTargetNumRegions=<num_regions>  -Xlog:gc*"

| Regions | time (ns) |
| ------- | --------- |
| 1024    | 5103      |
| 2048    | 6132      |
| 4096    | 12763     |
| 8192    | 24295     |
| 16384    | 33729     |

Overall the performance is optimal no matter how many heap regions.

Additional test:
- [ ] `make test TEST=hotspot_gc_shenandoah`

-------------

Commit messages:
 - Add empty line
 - clean
 - Fix build error on Windows
 - Revert "Add timing logs for execution of ShenandoahHeapRegionClosure"
 - Remove the default arg value to constructor of ShenandoahParallelHeapRegionTask
 - Auto derive stride for ShenandoahParallelHeapRegionTask when ShenandoahParallelRegionStride is set to 0
 - Dynamic calculate stride
 - Add timing logs for execution of ShenandoahHeapRegionClosure
 - Recalibrate ShenandoahParallelRegionStride value if there is no override from JVM args

Changes: https://git.openjdk.org/jdk/pull/20305/files
  Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20305&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8336640
  Stats: 21 lines in 2 files changed: 14 ins; 0 del; 7 mod
  Patch: https://git.openjdk.org/jdk/pull/20305.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/20305/head:pull/20305

PR: https://git.openjdk.org/jdk/pull/20305