RFR: 8372861: Genshen: Override parallel_region_stride of ShenandoahResetBitmapClosure to a reasonable value for better parallelism

Tue Dec 2 22:15:36 UTC 2025

On Tue, 2 Dec 2025 21:40:29 GMT, Xiaolong Peng <xpeng at openjdk.org> wrote:

>> The intention of the change is to let `ShenandoahResetBitmapClosure` not use the ShenandoahParallelRegionStride global value at all, here is the reasons: 
>> ShenandoahParallelRegionStride is usually set to a large value, the default value used to be 1024, large value won't help with the performance ShenandoahHeapRegionClosure at all for some reasons:
>> 1. ShenandoahResetBitmapClosure reset the marking bitmaps before/after GC cycle, the resetting may not not needed for each region. e.g. when `top_bitmap == bottom`(immediate trash regions?) or the region is not current gc generation.
>> 2. Withe large ShenandoahParallelRegionStride, each task will get large number of successive regions, e.g. worker 0 will process region 1 to region 1024, in this way it is not possible to make sure the actual workload is evenly distributed to all workers, some of the workers may have most of the regions need bitmap reset, some of the worker may not really do any actual bitmap reset at all. 
>> 
>> A smaller parallel region stride value will help with the workload distribution and also make it adaptive to different number of workers, it should be also working just fine with "worker surge"
>
> In the JBS bug report, I attached a test I did for this, I have tested value from 1 to 4096:
> 
> java -XX:+TieredCompilation -XX:+AlwaysPreTouch -Xms32G -Xmx32G -XX:+UseShenandoahGC -XX:+UnlockExperimentalVMOptions -XX:+UnlockDiagnosticVMOptions -Xlog:gc* -XX:-ShenandoahUncommit -XX:ShenandoahGCMode=generational -XX:+UseTLAB -XX:ShenandoahParallelRegionStride=<Stride Value>-jar ~/Downloads/dacapo-23.11-MR2-chopin.jar -n 5 h2 | grep "Concurrent Reset"
> 
> 
> 
> [1]
> [77.444s][info][gc,stats ] Concurrent Reset = 0.043 s (a = 3078 us) (n = 14) (lvls, us = 1172, 1289, 1328, 1406, 14780)
> [77.444s][info][gc,stats ] Concurrent Reset After Collect = 0.044 s (a = 3150 us) (n = 14) (lvls, us = 1074, 1504, 1895, 4121, 8952)
> 
> [2]
> [77.304s][info][gc,stats ] Concurrent Reset = 0.043 s (a = 3036 us) (n = 14) (lvls, us = 1152, 1211, 1289, 1328, 14872)
> [77.305s][info][gc,stats ] Concurrent Reset After Collect = 0.046 s (a = 3297 us) (n = 14) (lvls, us = 939, 1602, 2148, 3945, 8744)
> 
> 
> [4]
> [76.898s][info][gc,stats ] Concurrent Reset = 0.043 s (a = 3048 us) (n = 14) (lvls, us = 1152, 1230, 1270, 1328, 14989)
> [76.898s][info][gc,stats ] Concurrent Reset After Collect = 0.045 s (a = 3215 us) (n = 14) (lvls, us = 1016, 1309, 1914, 3301, 7076)
> 
> [8]
> [77.916s][info][gc,stats ] Concurrent Reset = 0.043 s (a = 3067 us) (n = 14) (lvls, us = 1152, 1211, 1270, 1309, 15091)
> [77.916s][info][gc,stats ] Concurrent Reset After Collect = 0.043 s (a = 3050 us) (n = 14) (lvls, us = 1133, 1484, 1934, 3086, 8113)
> 
> [16]
> [77.071s][info][gc,stats ] Concurrent Reset = 0.042 s (a = 3019 us) (n = 14) (lvls, us = 1152, 1250, 1270, 1328, 14615)
> [77.071s][info][gc,stats ] Concurrent Reset After Collect = 0.046 s (a = 3284 us) (n = 14) (lvls, us = 932, 1523, 2090, 2930, 8841)
> 
> [32]
> [76.965s][info][gc,stats ] Concurrent Reset = 0.044 s (a = 3117 us) (n = 14) (lvls, us = 1191, 1211, 1328, 1348, 14768)
> [76.965s][info][gc,stats ] Concurrent Reset After Collect = 0.047 s (a = 3323 us) (n = 14) (lvls, us = 930, 1406, 1875, 4316, 8565)
> 
> 
> [64]
> [77.255s][info][gc,stats ] Concurrent Reset = 0.042 s (a = 3033 us) (n = 14) (lvls, us = 1152, 1211, 1270, 1406, 14635)
> [77.255s][info][gc,stats ] Concurrent Reset After Collect = 0.054 s (a = 3862 us) (n = 14) (lvls, us = 1133, 1504, 2852, 5508, 8947)
> 
> [128]
> [76.502s][info][gc,stats ] Concurrent Reset = 0.042 s (a = 3027 us) (n = 14) (lvls, us = 1133, 1230, 1250, 1426, 14264)
> [76.502s][info][gc,stats ] Concurrent Reset After Collect = 0.053 s (a = 3762 us) (n = 14) (lvls, us = 1172, 15...

Maybe amend the comment to explain that using a smaller value yields better task distribution for a lumpy workload like resetting bitmaps?

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/28613#discussion_r2582942471