RFR: 8327452: G1: Improve scalability of Merge Log Buffers [v4]

Mon Mar 11 10:31:58 UTC 2024

On Fri, 8 Mar 2024 12:59:04 GMT, Ivan Walulya <iwalulya at openjdk.org> wrote:

>> Hi all,
>> 
>> Please review this change to reduce contention on the DCQS. The parallel work done by the worker threads is very small, therefore contention on the DCQS dominates the Merge Log Buffers phase. In this change, we add a sequential phase to distribute the Log Buffers to the worker threads. This removes the DCQS bottleneck in the highly contended case, at a small cost to the cases with low contention.
>> 
>> Testing Tier 1-3
>> 
>> The graphs below are using Bigramtester at 20gb, "Distribute and Avg Log Buffers" combines the duration of the sequential distribute phase and the avg duration of the parallel `Merge Log Buffers` phase.
>> | Time (ms)  | # Cards |
>> | ------------- | ------------- |
>> | ![128_concurrent_start](https://github.com/openjdk/jdk/assets/69453999/c6663ed9-6b81-4e32-bb1e-17627591406f) | ![128_concurrent_start_cards](https://github.com/openjdk/jdk/assets/69453999/1909564b-2482-4767-a334-9f5612d5cb8c)|
>> | ![128_mixed](https://github.com/openjdk/jdk/assets/69453999/d8dd8933-3aae-4795-b426-b4022bcf1ff2)| ![128_mixed_cards](https://github.com/openjdk/jdk/assets/69453999/ed5b8279-8d70-4dc6-b53d-59bf04b6e12e)|
>> | ![64_concurrent_start](https://github.com/openjdk/jdk/assets/69453999/5fb03cc5-2b86-4b9d-83ff-caa0eb482859) | ![64_concurrent_start_cards](https://github.com/openjdk/jdk/assets/69453999/c62fb05a-dca1-4b3f-895c-62837f7af2f8)|
>> | ![64_mixed](https://github.com/openjdk/jdk/assets/69453999/896c4138-a26b-457d-a5b6-d4b4285c4b8f) |  ![64_mixed_cards](https://github.com/openjdk/jdk/assets/69453999/2a6a234e-f180-4f02-9ad6-3867c5fd1881) |
>> | ![32_mixed](https://github.com/openjdk/jdk/assets/69453999/25a45fa6-f1dd-48e9-85af-702be67f7b9b) | ![32_mixed_cards](https://github.com/openjdk/jdk/assets/69453999/f5d80538-efde-47e4-860f-9b15409085a7) |
>> |  ![8_mixed](https://github.com/openjdk/jdk/assets/69453999/112b23c5-f048-4678-8d04-2c682696b357) | ![8_mixed_cards](https://github.com/openjdk/jdk/assets/69453999/0fdb9b6b-242c-424c-83ef-3d05703f5c42) |
>
> Ivan Walulya has updated the pull request incrementally with one additional commit since the last revision:
> 
>   remove ifdef

Changes requested by kbarrett (Reviewer).

src/hotspot/share/gc/g1/g1RemSet.cpp line 1303:

> 1301: 
> 1302:       BufferNode* head = buffers._head;
> 1303:       BufferNode* tail = head;

Because there may be non-full buffers, num_buffers may be an underestimate of
the actual number of buffers. And buffers_per_thread is rounded down. Because
of that, this calculation may result in various imbalances, which might be
somewhat significant. I think it would be better to instead calculate
(rounding up) and use entries_per_thread, and in the loop increment count by
cur->size() (before updating cur).

-------------

PR Review: https://git.openjdk.org/jdk/pull/18134#pullrequestreview-1927467061
PR Review Comment: https://git.openjdk.org/jdk/pull/18134#discussion_r1519494719