RFR: 8336640: Shenandoah: Parallel worker use in parallel_heap_region_iterate

Wed Jul 24 19:10:46 UTC 2024

On Wed, 24 Jul 2024 00:42:22 GMT, Xiaolong Peng <xpeng at openjdk.org> wrote:

> [parallel_heap_region_iterate](https://github.com/openjdk/jdk/blob/master/src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp#L1726-L1734) is used to execute lightweight operations on heap regions, including ShenandoahPrepareForMarkClosure, ShenandoahInitMarkUpdateRegionStateClosure, ShenandoahFinalUpdateRefsUpdateRegionStateClosure, ShenandoahResetUpdateRegionStateClosure and ShenandoahFinalMarkUpdateRegionStateClosure. Since all the operations are very lightweight, in regular cases w/o large number of heap regions, the parallelism seems to be an overkill because the cost of multi-thread orchestrating could be more expensive; In most cases, single thread should be more efficient. Also, if multiple threading is needed, we should maximize the utilization of all active workers for best performance.
> 
> This PR includes proposed improvments addressing the known issues:
> 1. Change the default value of ShenandoahParallelRegionStride to 0, when it is 0, Shenandoah will auto derive the value of stride for best performance; 
> 2. if num_regions is <= 4096, not use worker threads at all to avoid the overhead of multi-threading;
> 3. When num_regions is more than 4096, use worker threads to parallelize the workload, derive the value of stride to evenly distribute the workload to all active workers.
> 4. When number of active workers is 1, don't bother the workers, it is faster to finish the workload in current thread(avoid overhead of multi-threads orchestration)
> 
> There are some time metrics I collected from test with TIP version(I added time metrics for parallel_heap_region_iterate):
> 
> JVM args: export JAVA_OPTS="-Xms8G -Xmx8G  -XX:+AlwaysPreTouch -XX:+UseShenandoahGC -XX:+UnlockExperimentalVMOptions -XX:ShenandoahParallelRegionStride=<stride> -XX:ShenandoahTargetNumRegions=<num_regions>  -Xlog:gc*"
> 
> |             | 1024 regions | 2048 regions | 4096 regions | 8192 regions |16384 regions |
> | ----------- | ------------ | ------------ | ------------ | ------------ |------------ |
> | 1024 stride | 5785 ns         | 22194 ns        | 20953 ns        | 23008 ns        |33013 ns       |
> | 2048 stride | N/A          | 6491 ns         | 22476  ns        | 25842 ns        |34378 ns        |
> | 4096 stride | N/A          | N/A          | 14034 ns        | 28425 ns        |36324 ns        |
> | 8192 stride | N/A          | N/A          | N/A          | 24359 ns        |45231 ns        |
> | 16384 stride | N/A          | N/A          | N/A          | N/A          |53679 ns        |
> 
> Basically w...

src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp line 1697:

> 1695:   ShenandoahHeap* const _heap;
> 1696:   ShenandoahHeapRegionClosure* const _blk;
> 1697:   size_t _stride;

Should be `size_t const _stride;`?

src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp line 1729:

> 1727: void ShenandoahHeap::parallel_heap_region_iterate(ShenandoahHeapRegionClosure* blk) const {
> 1728:   assert(blk->is_thread_safe(), "Only thread-safe closures here");
> 1729:   const uint active_workers = workers() -> active_workers();

Suggestion:

  const uint active_workers = workers()->active_workers();

src/hotspot/share/gc/shenandoah/shenandoahHeap.cpp line 1737:

> 1735:     // not use worker threads to avoid the overhead; otherwise cacluate the stride by num_regions/active_workers
> 1736:     // to make sure every worker thread will have same amount of workload.
> 1737:     stride = n_regions <= 4096 ? 4096 : checked_cast<size_t>(ceil(checked_cast<float>(n_regions) / checked_cast<float>(active_workers)));

I suggest writing it like this:

  size_t stride = ShenandoahParallelRegionStride;

  if (stride == 0 && active_workers > 1) {
    // Automatically derive the stride to balance the work between threads
    // evenly. Do not try to split work if below the reasonable threshold.
    const size_t threshold = 4096;
    stride = (n_regions <= threshold) ?
            threshold :
            (n_regions + active_workers - 1) / active_workers;
  }

  if (n_regions > stride && active_workers > 1) {

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/20305#discussion_r1690045161
PR Review Comment: https://git.openjdk.org/jdk/pull/20305#discussion_r1690045677
PR Review Comment: https://git.openjdk.org/jdk/pull/20305#discussion_r1690214436