RFR: 8335356: Shenandoah: Improve concurrent cleanup locking

Wed Jul 10 20:07:44 UTC 2024

On Mon, 8 Jul 2024 21:38:41 GMT, Xiaolong Peng <xpeng at openjdk.org> wrote:

> Hi all,
>    This PR is to improve the usage of heap lock in ShenandoahFreeSet::recycle_trash, the original observation mentioned in the [bug](https://bugs.openjdk.org/browse/JDK-8335356) should be caused by an uncommitted/reverted change I added when Aleksey and I worked on [JDK-8331411](https://bugs.openjdk.org/browse/JDK-8331411). Even the change was not committed, the way ShenandoahFreeSet::recycle_trash using heap lock is still not efficient, we think we should improve it. 
>     With the logs added in this commit https://github.com/openjdk/jdk/pull/20086/commits/5688ee2c0754483818a89bb7915f58a7464c2df2, I got some key metrics: average time to acquire heap lock is about 450 ~ 900 ns, average time to recycle one trash regions is about 600ns (if not batched, it is 1000+ns, might be related to branch prediction). The current implementation takes heap lock once for every trash region, assume there are 1000 regions to recycle, the time wasted on acquiring heap lock is more than 0.6ms.
>     
>    The PR splits the recycling process into two steps: 1. Filter out all the trash regions; 2. recycle the trash regions in batches. I can see some benefits which improve the performance:
>     1. Less time spent on acquiring heap lock, less contention with mutators/allocators
>     2.Simpler loops in filtering and batch recycling, presumably benefit CPU branch prediction
>    
> Here are some logs from test running h2 benchmark:
> 
> TIP with debug log, [code link](https://github.com/openjdk/jdk/compare/master...pengxiaolong:jdk:JDK-8335356-baseline?expand=1), Average time per region: 2312 ns
> 
> [6.013s][info][gc] GC(0) Recycled 0 regions in 58675ns, break down: acquiring lock -> 0, recycling -> 0.
> [6.093s][info][gc] GC(0) Recycled 641 regions in 3025757ns, break down: acquiring lock -> 260016, recycling -> 548345.
> [9.354s][info][gc] GC(1) Recycled 1 regions in 61793ns, break down: acquiring lock -> 481, recycling -> 1141.
> [9.428s][info][gc] GC(1) Recycled 600 regions in 1083206ns, break down: acquiring lock -> 256578, recycling -> 511334.
> [12.145s][info][gc] GC(2) Recycled 35 regions in 118390ns, break down: acquiring lock -> 13703, recycling -> 27438.
> [12.202s][info][gc] GC(2) Recycled 553 regions in 911747ns, break down: acquiring lock -> 209511, recycling -> 426575.
> [15.086s][info][gc] GC(3) Recycled 106 regions in 218396ns, break down: acquiring lock -> 39089, recycling -> 80520.
> [15.164s][info][gc] GC(3) Recycled 454 regions in 762128ns, break down: acquiring lock -> 172583, recycl...

Marked as reviewed by ysr (Reviewer).

Impressive and a nice demonstration of the improvements! Benchmarking with HyperAlloc may also be useful or even just SPECjbb may show some non-linear improvements, who knows? May be worth measuring, perhaps? Running the count-based and time-based on a (slow,fast) x (arm,x86) system to fill the matrix would be great, but may be more effort than worthwhile, but just putting it out there. Good data of actual measured improvements always makes me happy, though!  :-) 

Thanks for the extra effort in collecting the data and sharing it.

Reviewed and approved, thank you!

-------------

PR Review: https://git.openjdk.org/jdk/pull/20086#pullrequestreview-2169690869
PR Comment: https://git.openjdk.org/jdk/pull/20086#issuecomment-2220964886