RFR: 8332455: Improve G1/ParallelGC tasks to not override array lengths

Fri Jul 5 13:24:22 UTC 2024

On Fri, 17 May 2024 12:02:08 GMT, Roman Kennke <rkennke at openjdk.org> wrote:

> In order to not cause excessive traffic on task queues when scanning large object arrays, G1 and ParallelGC use a way of slicing those arrays into smaller pieces. It overrides the from-space array's length field to track the array slices.
> 
> I think it would be cleaner if we improve the tasks such that array slices can be fully encoded in the task and does not require overriding the array length.
> 
> This PR borrows the principal encoding and slicing algorithm from Shenandoah (originally written by @shipilev). It also unifies the slicing implementations of the young GC and concurrent marking GC and ParallelGC.
> 
> For a description of the encoding and slicing algorithm, see top of arraySlicer.hpp.
> 
> On x86 (32-bit) we don't have enough bits in the single-word task to encode the slicing, so I'm extending the task to 64 bits (pointer and two int32 fields).
> 
> I put in some efforts to make sure the shared array slicing uses the user-configurable flags ParGCArrayScanChunk and ObjArrayMarkingStride just as before, but TBH, I don't see the point of having those flags as product flags to begin with. I would probably deprecate and remove ParGCArrayScanChunk, and use the develop flag ObjArrayMarkingStride everywhere. YMMV.
> 
> Testing:
>  - [x] hotspot_gc
>  - [x] tier1
>  - [x] tier2

Besides the problems discussed below, I also wondered about the rationale for
this change. The JBS issue (incorrectly, see below) describes what G1
currently does, but doesn't say anything about why it's a problem. I see this
issue has the lilliput label, so maybe it has something to do with that
project? Without that information it's hard to discuss any alternatives.

The bug report says G1 array slicing uses the from-space array's length field
to track the array slices.  That isn't true.  It uses the to-space array's
length field, since it uses PartialArrayTaskStepper.  ParallelGC uses the
from-space length in PSPromotionManager::process_array_chunk, though there is
an RFE to change ParallelGC to also use PartialArrayTaskStepper (JDK-8311163).

So why is that a problem?  The JBS issue is silent about that.  Some things
that came up in some internal to Oracle discussion include

(1) That approach has problems for the concurrent copying collectors.
(2) That approach doesn't work for G1 concurrent marking.
(3) For STW copying collectors, an allocation failure leaves one without a
separate copy, preventing use of that mechanism.

Having a unified mechanism that works for all collectors probably has some
benefits, if it doesn't impose some onorous cost on the STW copying
collectors. The allocation failure case should be rare, but it's still an
annoying special case to have to handle.

So yes, there are reasons to investigate alternatives.  But I think what is
being proposed here is busted by invalid assumptions about the address space.

I also wish the proposed change was broken up a little bit.  Changing all of
G1 STW, G1 Concurrent, and Parallel at the same time is hard to review, and I
don't think there's a need to do all in one change.

I have a couple of ideas for alternative mechanisms that I might explore.

src/hotspot/share/gc/shared/arraySlicer.hpp line 74:

> 72: //    10 bits for slice: max 1024 blocks per array
> 73: //     5 bits for power: max 2^32 array
> 74: //    49 bits for   oop: max 512 TB of addressable space

This encoding is incompatible with Linux Large Virtual Address space:
https://www.kernel.org/doc/html/v5.8/arm64/memory.html
which has a 52 bit address space.  I also  don't know of any reason why future address space
configuration  couldn't support the full non-tagged range (so 56 bits). I think that makes this
scheme not viable.

src/hotspot/share/gc/shared/arraySlicer.hpp line 206:

> 204:   void*    _ptr;
> 205:   uint16_t _slice;
> 206:   uint16_t _pow;

This implementation imposes significant additional overhead on the normal non-array case, since
an oop task needs to carry around the unused _slice and _pow members, including insertion and
removal from the taskqueue and such.  That seems pretty problematic for 32bit platforms that are
targeting relatively low end embedded space.

-------------

Changes requested by kbarrett (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/19282#pullrequestreview-2160305401
PR Review Comment: https://git.openjdk.org/jdk/pull/19282#discussion_r1666555197
PR Review Comment: https://git.openjdk.org/jdk/pull/19282#discussion_r1666705919