RFR: 8323791: Parallel: Using non-atomic for live bytes update during Full GC [v4]

Tue Jan 30 16:17:49 UTC 2024

On Mon, 29 Jan 2024 15:19:47 GMT, Albert Mingkun Yang <ayang at openjdk.org> wrote:

>> Simple optimization to avoid atomic-operation during Full GC marking.
>> 
>> Test: Full GC pause length shows ~1% regresion when using single gc-thread and ~20% improvement when using multiple gc-threads using the attached program.
>
> Albert Mingkun Yang has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains one additional commit since the last revision:
> 
>   pgc-fullgc-livebyte

I can see why this change could improve performance, but did you also try and measure full gc (and total gc) times with other applications than the given test?

In my testing (with lots of threads), the change improves performance on that reproducer with the given (imo _very_ uncommon) options even more than the 20%, so that is good.

However I have seen slight regressions (or at least no difference) with e.g. gcbasher 128m (with `-XX:MarkSweepDeadRatio=0 -XX:-ScavengeBeforeFullGC -Xms128m -Xmx128m` both with default 2 threads and up to 256 threads, but also just with `-Xmx/-Xms`).
Obviously not using a very tightly controlled application like the test adds variance, but the performance seems very inconclusive here.

The other question I have is whether you considered other approaches in mitigating the overhead (e.g. like G1 using a per-region cache)?
Reason why I'm asking is that adding a parallel phase with default number of threads is something somewhat scary to me due to spin-up costs and something that tends to blow up in some cases. (People in practice are running tiny applications with tons of threads as they tend to not optimize their startup scripts for small tools).

src/hotspot/share/gc/parallel/psParallelCompact.cpp line 1586:

> 1584: 
> 1585: class AggregateTask : public WorkerTask {
> 1586: private:

Suggestion:

class PSAggregateLiveWordsTask : public WorkerTask {

Suggestion:

class AggregateTask : public WorkerTask {

(superfluous newline, "private" specifier)

Although not done consistently in the file, I would prefer to make it more clear that this class belongs to parallel scavenge. Maybe an option is to forward-declare that class as inner class to `PSParallelCompact`.

Also, "Aggregate"Task seems very unspecific, and making it more specific would make the code more self-documenting. See my suggestion.

-------------

Changes requested by tschatzl (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/17442#pullrequestreview-1850893965
PR Review Comment: https://git.openjdk.org/jdk/pull/17442#discussion_r1471017068