RFR: 8335356: Shenandoah: Improve concurrent cleanup locking

Thu Jul 11 06:22:56 UTC 2024

On Mon, 8 Jul 2024 21:38:41 GMT, Xiaolong Peng <xpeng at openjdk.org> wrote:

> Hi all,
>    This PR is to improve the usage of heap lock in ShenandoahFreeSet::recycle_trash, the original observation mentioned in the [bug](https://bugs.openjdk.org/browse/JDK-8335356) should be caused by an uncommitted/reverted change I added when Aleksey and I worked on [JDK-8331411](https://bugs.openjdk.org/browse/JDK-8331411). Even the change was not committed, the way ShenandoahFreeSet::recycle_trash using heap lock is still not efficient, we think we should improve it. 
>     With the logs added in this commit https://github.com/openjdk/jdk/pull/20086/commits/5688ee2c0754483818a89bb7915f58a7464c2df2, I got some key metrics: average time to acquire heap lock is about 450 ~ 900 ns, average time to recycle one trash regions is about 600ns (if not batched, it is 1000+ns, might be related to branch prediction). The current implementation takes heap lock once for every trash region, assume there are 1000 regions to recycle, the time wasted on acquiring heap lock is more than 0.6ms.
>     
>    The PR splits the recycling process into two steps: 1. Filter out all the trash regions; 2. recycle the trash regions in batches. I can see some benefits which improve the performance:
>     1. Less time spent on acquiring heap lock, less contention with mutators/allocators
>     2.Simpler loops in filtering and batch recycling, presumably benefit CPU branch prediction
>    
> Here are some logs from test running h2 benchmark:
> 
> TIP with debug log, [code link](https://github.com/openjdk/jdk/compare/master...pengxiaolong:jdk:JDK-8335356-baseline?expand=1), Average time per region: 2312 ns
> 
> [6.013s][info][gc] GC(0) Recycled 0 regions in 58675ns, break down: acquiring lock -> 0, recycling -> 0.
> [6.093s][info][gc] GC(0) Recycled 641 regions in 3025757ns, break down: acquiring lock -> 260016, recycling -> 548345.
> [9.354s][info][gc] GC(1) Recycled 1 regions in 61793ns, break down: acquiring lock -> 481, recycling -> 1141.
> [9.428s][info][gc] GC(1) Recycled 600 regions in 1083206ns, break down: acquiring lock -> 256578, recycling -> 511334.
> [12.145s][info][gc] GC(2) Recycled 35 regions in 118390ns, break down: acquiring lock -> 13703, recycling -> 27438.
> [12.202s][info][gc] GC(2) Recycled 553 regions in 911747ns, break down: acquiring lock -> 209511, recycling -> 426575.
> [15.086s][info][gc] GC(3) Recycled 106 regions in 218396ns, break down: acquiring lock -> 39089, recycling -> 80520.
> [15.164s][info][gc] GC(3) Recycled 454 regions in 762128ns, break down: acquiring lock -> 172583, recycl...

Based on Aleksey's benchmark, I wrote a very [simple benchmark](https://github.com/pengxiaolong/benchmarks/blob/main/allocation-latency/src/main/java/personal/xlpeng/benchmarks/allocationlatency/AllocationLatency.java) to generate HdrHistogram, run command like below to generate HdrHistogram metrics:

export JAVA_HOME=/home/xlpeng/repos/jdk-xlpeng/optimized-timed-lock
export JAVA_OPTS="-Xms30g -Xmx30g -XX:+AlwaysPreTouch -XX:+UseShenandoahGC"
./build/distributions/allocation-latency/bin/allocation-latency
export JAVA_HOME=/home/xlpeng/repos/jdk-xlpeng/baseline
./build/distributions/allocation-latency/bin/allocation-latency

Here is the HdrHistogram:

![Histogram](https://github.com/openjdk/jdk/assets/2170530/fe52c38e-6c30-442c-b4c8-2c7ee06c5278)

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20086#issuecomment-2222121679