RFR: 8327452: G1: Improve scalability of Merge Log Buffers

Wed Mar 6 18:31:45 UTC 2024

On Wed, 6 Mar 2024 11:18:03 GMT, Ivan Walulya <iwalulya at openjdk.org> wrote:

> Hi all,
> 
> Please review this change to reduce contention on the DCQS. The parallel work done by the worker threads is very small, therefore contention on the DCQS dominates the Merge Log Buffers phase. In this change, we add a sequential phase to distribute the Log Buffers to the worker threads. This removes the DCQS bottleneck in the highly contended case, at a small cost to the cases with low contention.
> 
> Testing Tier 1-3
> 
> The graphs below are using Bigramtester at 20gb, "Distribute and Avg Log Buffers" combines the duration of the sequential distribute phase and the avg duration of the parallel `Merge Log Buffers` phase.
> | Time (ms)  | # Cards |
> | ------------- | ------------- |
> | ![128_concurrent_start](https://github.com/openjdk/jdk/assets/69453999/c6663ed9-6b81-4e32-bb1e-17627591406f) | ![128_concurrent_start_cards](https://github.com/openjdk/jdk/assets/69453999/1909564b-2482-4767-a334-9f5612d5cb8c)|
> | ![128_mixed](https://github.com/openjdk/jdk/assets/69453999/d8dd8933-3aae-4795-b426-b4022bcf1ff2)| ![128_mixed_cards](https://github.com/openjdk/jdk/assets/69453999/ed5b8279-8d70-4dc6-b53d-59bf04b6e12e)|
> | ![64_concurrent_start](https://github.com/openjdk/jdk/assets/69453999/5fb03cc5-2b86-4b9d-83ff-caa0eb482859) | ![64_concurrent_start_cards](https://github.com/openjdk/jdk/assets/69453999/c62fb05a-dca1-4b3f-895c-62837f7af2f8)|
> | ![64_mixed](https://github.com/openjdk/jdk/assets/69453999/896c4138-a26b-457d-a5b6-d4b4285c4b8f) |  ![64_mixed_cards](https://github.com/openjdk/jdk/assets/69453999/2a6a234e-f180-4f02-9ad6-3867c5fd1881) |
> | ![32_mixed](https://github.com/openjdk/jdk/assets/69453999/25a45fa6-f1dd-48e9-85af-702be67f7b9b) | ![32_mixed_cards](https://github.com/openjdk/jdk/assets/69453999/f5d80538-efde-47e4-860f-9b15409085a7) |
> |  ![8_mixed](https://github.com/openjdk/jdk/assets/69453999/112b23c5-f048-4678-8d04-2c682696b357) | ![8_mixed_cards](https://github.com/openjdk/jdk/assets/69453999/0fdb9b6b-242c-424c-83ef-3d05703f5c42) |

Looks good.  Just one suggestion for a possible(?) improvement.

src/hotspot/share/gc/g1/g1RemSet.cpp line 1302:

> 1300: 
> 1301:       uint worker = 0;
> 1302:       while (cur != nullptr) {

Rather than transferring buffers one-by-one (with a somewhat expensive though uncontended push
for each), one could use `buffers._entry_count` and `num_workers` to get the number of buffers to
put in each slot, and peel off that many from the list for each slot.  That's maybe a little bit more
complicated than the current code?  And probably needs a lot of buffers to be noticeable.

-------------

Marked as reviewed by kbarrett (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/18134#pullrequestreview-1920493822
PR Review Comment: https://git.openjdk.org/jdk/pull/18134#discussion_r1514956801