GC interface update

Tue Apr 25 16:30:51 UTC 2017

Hi Erik,

wow that is a lot of stuff coming up :-) It mostly matches what I had in
mind, and even goes much further (still need to digest the glory details
;-) ). Some quick thoughts in random order:

I like that you planned for generators for interpreter, c1, c2 and
graal. I was about to start working on that, but if you're already doing
it, I don't need to.

Regarding Shenandoah, what we need there is a way to apply read- and
write-barriers on *every* heap access. This is used to resolve the
target object to its to-space copy. For example, when reading a field
from an object, and that object is in from-space, we first need to read
its forwarding pointer to arrive at the to-space copy and read from
there. Similary, for writes, we first need to invoke some write barrier
magic to copy the target object to to-space, CAS the forwarding pointer
of the from-space object, and then do the write into the to-space copy.
In pseudocode, a store (or load) that used to look like this:

store(oop obj, int offset, int value)

now needs to look like this:

obj = write_barrier(obj)
store(obj, offset, value)

With regards to the GC interface, this means we need to have access to
the source (for loads) or target (for stores) object, not only the
actual field address. Infact, a field address would be pointless, what
we need is the object+offset.

Does your proposal provide for this?

Why don't you push all this into the jdk10-sandbox, under the
JDK-8163329-branch (aka GC-interface-branch) ? We do need to collaborate
on this stuff, and the best way to do that would be with actual code
exchange. It's easy to do in the sandbox: we can go completely wild in
there until we're satisfied ;-)

To be honest, I wouldn't go over the top to optimize runtime barrier
accesses. I haven't seen a single benchmark yet that suffers from
virtual calls. Shenandoah does introduce *much* more virtual calls (in
its current design), i.e. one for each primitive load and store, and it
doesn't seem to impact performance or show up in profiles on benchmarks
we are running *at all*. It seems like a complete non-issue to me. I
suppose it is possible to construct benchmarks where it does matter
(heavily exercising JNI heap accessors comes to mind), but even then I
don't think virtual calls in the runtime accessors hurt that much. If
you have such benchmark, please please let me (or us) know. Aleksey is
building a GC benchmark suite, and I'm sure he'd be happy to include
such a benchmark [1].

[1] http://icedtea.classpath.org/people/shade/gc-bench/

My idea for runtime accessors basically boiled down to the API that's
currently in oop.hpp / oop.inline.hpp: i.e. forward all heap access
through the barrier via 1 (and only 1) virtual call. I don't exactly
mind some magic to avoid even this one virtual call, but I question if
it's worth the additional complexity (which doesn't sound exactly
negligible).

We can help with the arm64 port. :-)

Those comments are purely based on your description. Now I'm going to
study your patch :-)

Again, I think it would be great to simply push this to the sandbox.
Also, I'm hanging out on IRC #openjdk on OFTC as rkennke.

Cheers,
Roman

Am 25.04.2017 um 17:35 schrieb Erik Österlund:
> Hi everyone,
> 
> I'm glad to see that we all want to go towards modularizing the GC
> implementations in hotspot more. Thank you Roman for starting this
> thread. I have wanted a better GC interface since I first set foot in
> hotspot.
> 
> As mentioned, I have been cooking up a GC barrier interface prototype
> based on ideas mentioned earlier in this thread. I will provide a
> preview of where it is headed in this email before we start to diverge
> too much.
> 
> I have a long patch queue with many individual changes, but to get an
> overview for the discussion here, I will start by posting a combined
> webrev for preview of the whole thing as a pre-review. Once the real
> detailed review starts, I will start sending out the smaller incremental
> changes that are easier to grasp in reviews.
> 
> The webrev is based on the latest jdk10-hs repo.
> 
> Full webrev: http://cr.openjdk.java.net/~eosterlund/gc_interface/webrev.00/
> 
> == High Level Design Philosophy ==
> 
> The overall design idea has been to remove all explicit calls to barrier
> code in the VM. The barrier code is often required for conforming to
> some kind of semantics. So rather than having these explicit barrier
> calls in the shared code, the new API is instead to perform a memory
> access that specifies the intended semantic properties instead. These
> semantic properties are not only GC related but could be any property
> that is important for performing a memory access with the right
> semantics. These semantic properties are called decorators in my API.
> 
> So for example, a decorator for a load on an oop could be ACCESS_ON_HEAP
> to denote that this access is performed in the Java heap, and MO_ACQUIRE
> that the access should have acquire memory ordering semantics and
> ACCESS_ON_WEAK to denote that the access is performed on a weakly
> reachable reference. This would in the end need to boil down to the
> following barriers:
> 1) Compressed oops to decode narrow oops
> 2) Potentially an acquire membar on e.g. ARM machines
> 3) Potentially a SATB enqueue barrier for SATB-type GCs
> 
> So rather than treating these differently and having runtime-resolved
> explicit barriers built bottom up, the approach is to build accesses top
> down instead. The GC may override the whole access to to anything, but
> will probably want to reuse things like decoding and encoding compressed
> oops, and the pre/post write pattern and memory ordering. Therefore
> barrier sets may ask their super class to fill in such details. This
> allows arbitrary level of control without introducing code duplication.
> 
> Each BarrierSet has 4 barrier-related components with a similar design.
> 
> 1) An AccessBarrier class responsible for performing accesses requested
> by the runtime system through a new Access API (more later)
> 2) A BarrierSetCodeGen class responsible for generating accesses in
> platform-specific assembly code (stub generators and interpreter)
> 3) A C1BarrierSetCodeGen class responsible for generating accesses for
> the C1 compiler
> 4) A C2BarrierSetCodeGen class responsible for generating accesses for
> the C2 compiler
> 
> So there is one class for each part of hotspot (runtime,
> platform-specific, c1, c2), and they all follow a class hierarchy
> mirroring their respective BarrierSet hierarchy to reuse more general
> functionality like memory ordering and compressed oops.
> 
> == Runtime: Access API ==
> 
> The runtime part of the API goes through a new class called Access. All
> decorated accesses should go through this interface. It makes heavy use
> of templates to perform the right accesses and barriers in the most
> optimal way, by connecting the intended Access semantics to the
> appropriately decorated AccessBarrier of the current BarrierSet. It
> combines different decorators in a pipeline that are resolved at
> different times in the JVM life cycle, but handled in the same way. Some
> decorators are resolved at build-time, like for example whether the
> build needs to support barriers on primitives. If Shenandoah is built
> for example, this decorator will be set, and if Shenandoah is not built,
> it will not be set. Other decorators are resolved statically at the call
> site, such as what strength a reference has. Yet some decorators are
> resolved at runtime, such as whether compressed oops are used or not and
> which garbage collector was selected.
> 
> When there exists runtime dependencies for resolving a barrier, the
> Access system will generate function pointers for the access. The
> function pointers initially point to a resolver function that checks the
> selected runtime properties, and then patches the function pointer to
> point to a statically generated function with those properties set, so
> that the next time the function is called, it will call straight into
> the appropriate barrier. This means that where we would previously have
> multiple virtual calls for pre- and post-write barriers, followed by if
> checks for compressed oops, all of that boils down to a single function
> pointer call that then has inlined everything that needs to be done for
> that set of runtime parameters.
> 
> The goal has been to separate out GC-specific code to GC-specific
> directories as far as barriers are concerned. To glue this together,
> there is a barrierSetConfig.hpp and barrierSetConfig.inline.hpp. The
> barrierSetConfig.hpp configures what barrier sets there are and produces
> a macro allowing you to do something for each barrier set. This is used
> by barrier resolution at runtime. The barrierSetConfig.inline.hpp
> basically just includes in the GC-specific inline headers to allow
> inlining the GC barriers all the way. So anyone making a new GC should
> put their GC in there. I added a Shenandoah GC therere as an example so
> you can see what I mean.
> 
> The Access API goes through a template pipeline. First the Access class
> bridges the API to functions in the AccessInternal namespace. This
> involves using temporary proxy objects to artificially infer the return
> types of loads. Then in the AccessInternal namespace the types are first
> decayed, meaning that CV-qualifiers and references are stripped. Then
> types of addresses and values are joined, at which times certain
> decorators are infered like the use of compressed oops when e.g. loading
> an oop from a narrowOop*. Other implicit decorarors are also inferred
> then, such as a default memory ordering if none is specified, and other
> rules related to memory ordering such as sequential consistent stores
> implicitly also being releasing stores etc. Then buildtime decorators
> are added and a pre-runtime stage is reached where the mechanism tries
> to bind accesses statically if possible, and otherwise producing a
> runtime-dispatch point that statically generates all possible runtime
> variants of the access and a self-patching function pointer that
> resolves the correct variant at runtime. These statically generated
> barriers are resolved through the BarrierSet AccessBarrier that gives
> the GC full control for generating an appropriate access. It can use the
> DecoratorTest class to check for different decorators specifying
> semantics that add barriers altering the access. Eventually, a super
> class of the AccessBarrier called BasicAccessBarrier that handles
> compressed oops and it calls RawAccessBarrier that inspects the decayed
> times and forwards to appropriate calls to Atomic, OrderAccess or
> performs volatile or raw accesses depending on selected memory ordering.
> 
> I have then applied the Access API to many weird accesses performed in
> the runtime system where we check if we are using G1 and then
> subsequently doing some weird ad-hoc SATB enqueue barrier. Examples
> include the string table, ciMetadata and jvmtiTagMap, unsafe get,
> reference get, jweak resolve etc. These accesses now use decorated
> accesses through Access instead.
> 
> == C1 ==
> 
> The shared C1 barrier code has been moved into the C1BarrierSetCodeGen
> class for each specific barrier set. It generates decorated accesses,
> and decorates it with GC barriers as required by the specified
> semantics. The slowpath stubs have been refactored. The code sutbs have
> moved into the C1BarrierSetCodeGen class and it assembles the machine
> code with the platform specific BarrierSetCodeGen assembler. The
> runtime1 code stubs have been changed to not be generated in switch
> statements, but instead with a code generation closure that calls into
> the C1BarrierSetCodeGen that calls assembles the runtime1 stub with the
> platform specific BarrierSetCodeGen.
> 
> The design of accesses going through C1BarrierSetCodeGen is consistent
> with the rest of the Access API: the accesses are built top down and
> allows overriding the whole operation. The C1BarrierSetCodeGen class
> mirrors the class hierarchy of the BarrierSets.
> 
> == C2 ==
> 
> Similar to the C1BarrierSetCodeGen, the C2BarrierSetCodeGen helps the
> GraphKit generate decorated accesses top-down. The class hierarchy of
> the C2BarrierSetCodeGen class mirrors the class hierarchy of the
> BarrierSets. Since C2 expands GC barriers rather early and then pulls
> the barriers through optimizations, there are some additional calls to
> be able to distinguish barrier-related nodes from non-barrier nodes.
> 
> == Graal ==
> 
> For now I only try not to break the Graal port used for AoT in the
> hotspot repository. Ideally, graal would follow the same pattern, but
> initially this is out of scope for me.
> 
> == GC: BarrierSet consolidation ==
> 
> The hierarchy of our barrier sets seem unnecessarily deep - partially
> because the card table itself is part of the card table barrier sets. I
> have split the card table hierarchy and separated it from the barrier
> set hierarchy. A CardTableModRefBarrier *has* a CardTable. As a result
> the hierarchy could be simplified a lot to contains only BarrierSet,
> ModRefBarrierSet, CardTableBarrierSet and G1BarrierSet. G1BarrierSet and
> CardTableBarrierSet are the only leaves, and ModRefBarrierSet is only a
> small helper class.
> 
> == Colaboration ==
> 
> I hope you like the direction this is going and hope it will suit
> Shenandoah as well. I have not yet applied the Access API for all
> primitives yet because I thought that you probably have a better idea
> where they are since your GC uses such barriers a lot more. But the
> framework should be able to support that without much trouble. So I hope
> we can work together a bit on this. If there are any shortcomings, I
> hope we can work it out together.
> 
> Also, as you can see, I have only provided x86 and SPARC ports so far.
> The architecture specific code mostly involves the stub generators, the
> interpreter, and the G1 C1 slow path stuff. I was hoping to eventually
> get some help from other port maintainers to port this to their
> respective platforms. If you feel compelled to help porting this to ARM,
> I would be very happy. ;)
> 
> And perhaps somebody would like to help getting PPC and S390 on board
> too. I thought I would at least start the discussion now.
> 
> So yeah, hope everyone likes this direction. If there are any questions,
> I will happily answer them. Any feedback is very welcome.
> 
> Thanks,
> /Erik
> 
> On 2017-04-25 14:05, Per Liden wrote:
>> On 2017-04-24 15:46, Roman Kennke wrote:
>>> Am 24.04.2017 um 08:37 schrieb Per Liden:
>>>> On 04/20/2017 02:29 PM, Roman Kennke wrote:
>>>>> Am 20.04.2017 um 14:01 schrieb Per Liden:
>>>>>> On 2017-04-20 12:05, Aleksey Shipilev wrote:
>>>>>>> On 04/20/2017 09:37 AM, Kirk Pepperdine wrote:
>>>>>>>>> Good stuff. However, one thing I'm not quite comfortable with
>>>>>>>>> is the
>>>>>>>>> introduction of the GC class (and its sub classes). I don't quite
>>>>>>>>> see the
>>>>>>>>> purpose of this interface split-up between GC and CollectedHeap. I
>>>>>>>>> view
>>>>>>>>> CollectedHeap as _the_ interface (but yes, it needs some love),
>>>>>>>>> and
>>>>>>>>> as a
>>>>>>>>> result I think the the functions you've exposed in the GC class
>>>>>>>>> actually
>>>>>>>>> belongs in CollectedHeap.
>>>>>>>>
>>>>>>>> I thought the name CollectedHeap implied the state of the heap
>>>>>>>> after the
>>>>>>>> collector has completed. What is the intent of CollectedHeap?
>>>>>>>
>>>>>>> No, CollectedHeap is the actual current GC interface. This is the
>>>>>>> entry point to
>>>>>>> GC as far as the rest of runtime is concerned, see e.g.
>>>>>>> CollectedHeap*
>>>>>>> Universe::create_heap(), etc. Implementing CollectedHeap,
>>>>>>> CollectorPolicy, and
>>>>>>> BarrierSet are the bare minimum required for GC implementation
>>>>>>> today. [1]
>>>>>>
>>>>>> Yep, and I'd like us to move towards tightening down the GC
>>>>>> interface to
>>>>>> basically be cleaned up versions of CollectedHeap and BarrierSet.
>>>>>>
>>>>>> CollectorPolicy and some other things that class drags along, like
>>>>>> AdaptiveSizePolicy, are way too collector specific and I don't think
>>>>>> that should be exposed to the rest of the VM.
>>>>>
>>>>> Right, I totally agree with this.
>>>>>
>>>>> BTW, another reason for making a new GC interface class instead of
>>>>> further bloating CollectedHeap as the central interface was that there
>>>>> is way too much implementation stuff in CollectedHeap. Ideally, I'd
>>>>> like
>>>>> to have a true interface with no or only trivial implementations
>>>>> for the
>>>>> declared methods, and most importantly nothing that's only ever needed
>>>>> by the GC itself (and never called by the runtime). But as I said, I'm
>>>>> not against a serious refactoring and tightening-up of CollectedHeap
>>>>> instead.
>>>>
>>>> Yes, I'd like to keep CollectedHeap as the main interface, but I
>>>> completely agree that CollectedHeap currently contains too much
>>>> implementation stuff that we probably want to move out.
>>>
>>> Ok, I will revert that part of the change to use CollectedHeap as main
>>> interface then. It's no big deal, so far I only had one additional
>>> method for servicability support in the GC interface class anyway.
>>
>> Ok, sounds good.
>>
>> And regarding BarrierSet. As you know, Erik Österlund is working on
>> overhauling BarrierSet and how barriers are used across the VM. He'll
>> be sending out his current proposal later today.
>>
>>>
>>> Would you also prefer keep 'management' of the heap in Universe too?
>>> I.e. Universe::create_heap() and Universe::heap() etc? Or do you see a
>>> benefit in moving it out like I did with gc_factory.cpp? The idea being
>>> that there's only one smallish place that knows about all the existing
>>> GC impls?
>>
>> I'd like to keep Universe::heap() and create_heap(), but I'd like to
>> move away from our current if-else if-else.. and instead have a more
>> declarative way of saying which GC's are available. create_heap()
>> would then just walk the list of available GC and ask if it's enabled
>> and if so create an instance. I think we'd want to do something
>> similar to (or even combine this with) what Erik is doing in his
>> BarrierSet patch.
>>
>> In general, to make it easier to review/test/integrate all these
>> changes it would be good if we can have incremental patches, each
>> addressing some specific/contained area.
>>
>> cheers,
>> Per
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: OpenPGP digital signature
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20170425/5c561217/signature.asc>