GC interface update

Tue Apr 25 15:35:57 UTC 2017

Hi everyone,

I'm glad to see that we all want to go towards modularizing the GC 
implementations in hotspot more. Thank you Roman for starting this 
thread. I have wanted a better GC interface since I first set foot in 
hotspot.

As mentioned, I have been cooking up a GC barrier interface prototype 
based on ideas mentioned earlier in this thread. I will provide a 
preview of where it is headed in this email before we start to diverge 
too much.

I have a long patch queue with many individual changes, but to get an 
overview for the discussion here, I will start by posting a combined 
webrev for preview of the whole thing as a pre-review. Once the real 
detailed review starts, I will start sending out the smaller incremental 
changes that are easier to grasp in reviews.

The webrev is based on the latest jdk10-hs repo.

Full webrev: http://cr.openjdk.java.net/~eosterlund/gc_interface/webrev.00/

== High Level Design Philosophy ==

The overall design idea has been to remove all explicit calls to barrier 
code in the VM. The barrier code is often required for conforming to 
some kind of semantics. So rather than having these explicit barrier 
calls in the shared code, the new API is instead to perform a memory 
access that specifies the intended semantic properties instead. These 
semantic properties are not only GC related but could be any property 
that is important for performing a memory access with the right 
semantics. These semantic properties are called decorators in my API.

So for example, a decorator for a load on an oop could be ACCESS_ON_HEAP 
to denote that this access is performed in the Java heap, and MO_ACQUIRE 
that the access should have acquire memory ordering semantics and 
ACCESS_ON_WEAK to denote that the access is performed on a weakly 
reachable reference. This would in the end need to boil down to the 
following barriers:
1) Compressed oops to decode narrow oops
2) Potentially an acquire membar on e.g. ARM machines
3) Potentially a SATB enqueue barrier for SATB-type GCs

So rather than treating these differently and having runtime-resolved 
explicit barriers built bottom up, the approach is to build accesses top 
down instead. The GC may override the whole access to to anything, but 
will probably want to reuse things like decoding and encoding compressed 
oops, and the pre/post write pattern and memory ordering. Therefore 
barrier sets may ask their super class to fill in such details. This 
allows arbitrary level of control without introducing code duplication.

Each BarrierSet has 4 barrier-related components with a similar design.

1) An AccessBarrier class responsible for performing accesses requested 
by the runtime system through a new Access API (more later)
2) A BarrierSetCodeGen class responsible for generating accesses in 
platform-specific assembly code (stub generators and interpreter)
3) A C1BarrierSetCodeGen class responsible for generating accesses for 
the C1 compiler
4) A C2BarrierSetCodeGen class responsible for generating accesses for 
the C2 compiler

So there is one class for each part of hotspot (runtime, 
platform-specific, c1, c2), and they all follow a class hierarchy 
mirroring their respective BarrierSet hierarchy to reuse more general 
functionality like memory ordering and compressed oops.

== Runtime: Access API ==

The runtime part of the API goes through a new class called Access. All 
decorated accesses should go through this interface. It makes heavy use 
of templates to perform the right accesses and barriers in the most 
optimal way, by connecting the intended Access semantics to the 
appropriately decorated AccessBarrier of the current BarrierSet. It 
combines different decorators in a pipeline that are resolved at 
different times in the JVM life cycle, but handled in the same way. Some 
decorators are resolved at build-time, like for example whether the 
build needs to support barriers on primitives. If Shenandoah is built 
for example, this decorator will be set, and if Shenandoah is not built, 
it will not be set. Other decorators are resolved statically at the call 
site, such as what strength a reference has. Yet some decorators are 
resolved at runtime, such as whether compressed oops are used or not and 
which garbage collector was selected.

When there exists runtime dependencies for resolving a barrier, the 
Access system will generate function pointers for the access. The 
function pointers initially point to a resolver function that checks the 
selected runtime properties, and then patches the function pointer to 
point to a statically generated function with those properties set, so 
that the next time the function is called, it will call straight into 
the appropriate barrier. This means that where we would previously have 
multiple virtual calls for pre- and post-write barriers, followed by if 
checks for compressed oops, all of that boils down to a single function 
pointer call that then has inlined everything that needs to be done for 
that set of runtime parameters.

The goal has been to separate out GC-specific code to GC-specific 
directories as far as barriers are concerned. To glue this together, 
there is a barrierSetConfig.hpp and barrierSetConfig.inline.hpp. The 
barrierSetConfig.hpp configures what barrier sets there are and produces 
a macro allowing you to do something for each barrier set. This is used 
by barrier resolution at runtime. The barrierSetConfig.inline.hpp 
basically just includes in the GC-specific inline headers to allow 
inlining the GC barriers all the way. So anyone making a new GC should 
put their GC in there. I added a Shenandoah GC therere as an example so 
you can see what I mean.

The Access API goes through a template pipeline. First the Access class 
bridges the API to functions in the AccessInternal namespace. This 
involves using temporary proxy objects to artificially infer the return 
types of loads. Then in the AccessInternal namespace the types are first 
decayed, meaning that CV-qualifiers and references are stripped. Then 
types of addresses and values are joined, at which times certain 
decorators are infered like the use of compressed oops when e.g. loading 
an oop from a narrowOop*. Other implicit decorarors are also inferred 
then, such as a default memory ordering if none is specified, and other 
rules related to memory ordering such as sequential consistent stores 
implicitly also being releasing stores etc. Then buildtime decorators 
are added and a pre-runtime stage is reached where the mechanism tries 
to bind accesses statically if possible, and otherwise producing a 
runtime-dispatch point that statically generates all possible runtime 
variants of the access and a self-patching function pointer that 
resolves the correct variant at runtime. These statically generated 
barriers are resolved through the BarrierSet AccessBarrier that gives 
the GC full control for generating an appropriate access. It can use the 
DecoratorTest class to check for different decorators specifying 
semantics that add barriers altering the access. Eventually, a super 
class of the AccessBarrier called BasicAccessBarrier that handles 
compressed oops and it calls RawAccessBarrier that inspects the decayed 
times and forwards to appropriate calls to Atomic, OrderAccess or 
performs volatile or raw accesses depending on selected memory ordering.

I have then applied the Access API to many weird accesses performed in 
the runtime system where we check if we are using G1 and then 
subsequently doing some weird ad-hoc SATB enqueue barrier. Examples 
include the string table, ciMetadata and jvmtiTagMap, unsafe get, 
reference get, jweak resolve etc. These accesses now use decorated 
accesses through Access instead.

== C1 ==

The shared C1 barrier code has been moved into the C1BarrierSetCodeGen 
class for each specific barrier set. It generates decorated accesses, 
and decorates it with GC barriers as required by the specified 
semantics. The slowpath stubs have been refactored. The code sutbs have 
moved into the C1BarrierSetCodeGen class and it assembles the machine 
code with the platform specific BarrierSetCodeGen assembler. The 
runtime1 code stubs have been changed to not be generated in switch 
statements, but instead with a code generation closure that calls into 
the C1BarrierSetCodeGen that calls assembles the runtime1 stub with the 
platform specific BarrierSetCodeGen.

The design of accesses going through C1BarrierSetCodeGen is consistent 
with the rest of the Access API: the accesses are built top down and 
allows overriding the whole operation. The C1BarrierSetCodeGen class 
mirrors the class hierarchy of the BarrierSets.

== C2 ==

Similar to the C1BarrierSetCodeGen, the C2BarrierSetCodeGen helps the 
GraphKit generate decorated accesses top-down. The class hierarchy of 
the C2BarrierSetCodeGen class mirrors the class hierarchy of the 
BarrierSets. Since C2 expands GC barriers rather early and then pulls 
the barriers through optimizations, there are some additional calls to 
be able to distinguish barrier-related nodes from non-barrier nodes.

== Graal ==

For now I only try not to break the Graal port used for AoT in the 
hotspot repository. Ideally, graal would follow the same pattern, but 
initially this is out of scope for me.

== GC: BarrierSet consolidation ==

The hierarchy of our barrier sets seem unnecessarily deep - partially 
because the card table itself is part of the card table barrier sets. I 
have split the card table hierarchy and separated it from the barrier 
set hierarchy. A CardTableModRefBarrier *has* a CardTable. As a result 
the hierarchy could be simplified a lot to contains only BarrierSet, 
ModRefBarrierSet, CardTableBarrierSet and G1BarrierSet. G1BarrierSet and 
CardTableBarrierSet are the only leaves, and ModRefBarrierSet is only a 
small helper class.

== Colaboration ==

I hope you like the direction this is going and hope it will suit 
Shenandoah as well. I have not yet applied the Access API for all 
primitives yet because I thought that you probably have a better idea 
where they are since your GC uses such barriers a lot more. But the 
framework should be able to support that without much trouble. So I hope 
we can work together a bit on this. If there are any shortcomings, I 
hope we can work it out together.

Also, as you can see, I have only provided x86 and SPARC ports so far. 
The architecture specific code mostly involves the stub generators, the 
interpreter, and the G1 C1 slow path stuff. I was hoping to eventually 
get some help from other port maintainers to port this to their 
respective platforms. If you feel compelled to help porting this to ARM, 
I would be very happy. ;)

And perhaps somebody would like to help getting PPC and S390 on board 
too. I thought I would at least start the discussion now.

So yeah, hope everyone likes this direction. If there are any questions, 
I will happily answer them. Any feedback is very welcome.

Thanks,
/Erik

On 2017-04-25 14:05, Per Liden wrote:
> On 2017-04-24 15:46, Roman Kennke wrote:
>> Am 24.04.2017 um 08:37 schrieb Per Liden:
>>> On 04/20/2017 02:29 PM, Roman Kennke wrote:
>>>> Am 20.04.2017 um 14:01 schrieb Per Liden:
>>>>> On 2017-04-20 12:05, Aleksey Shipilev wrote:
>>>>>> On 04/20/2017 09:37 AM, Kirk Pepperdine wrote:
>>>>>>>> Good stuff. However, one thing I'm not quite comfortable with 
>>>>>>>> is the
>>>>>>>> introduction of the GC class (and its sub classes). I don't quite
>>>>>>>> see the
>>>>>>>> purpose of this interface split-up between GC and CollectedHeap. I
>>>>>>>> view
>>>>>>>> CollectedHeap as _the_ interface (but yes, it needs some love), 
>>>>>>>> and
>>>>>>>> as a
>>>>>>>> result I think the the functions you've exposed in the GC class
>>>>>>>> actually
>>>>>>>> belongs in CollectedHeap.
>>>>>>>
>>>>>>> I thought the name CollectedHeap implied the state of the heap
>>>>>>> after the
>>>>>>> collector has completed. What is the intent of CollectedHeap?
>>>>>>
>>>>>> No, CollectedHeap is the actual current GC interface. This is the
>>>>>> entry point to
>>>>>> GC as far as the rest of runtime is concerned, see e.g. 
>>>>>> CollectedHeap*
>>>>>> Universe::create_heap(), etc. Implementing CollectedHeap,
>>>>>> CollectorPolicy, and
>>>>>> BarrierSet are the bare minimum required for GC implementation
>>>>>> today. [1]
>>>>>
>>>>> Yep, and I'd like us to move towards tightening down the GC 
>>>>> interface to
>>>>> basically be cleaned up versions of CollectedHeap and BarrierSet.
>>>>>
>>>>> CollectorPolicy and some other things that class drags along, like
>>>>> AdaptiveSizePolicy, are way too collector specific and I don't think
>>>>> that should be exposed to the rest of the VM.
>>>>
>>>> Right, I totally agree with this.
>>>>
>>>> BTW, another reason for making a new GC interface class instead of
>>>> further bloating CollectedHeap as the central interface was that there
>>>> is way too much implementation stuff in CollectedHeap. Ideally, I'd 
>>>> like
>>>> to have a true interface with no or only trivial implementations 
>>>> for the
>>>> declared methods, and most importantly nothing that's only ever needed
>>>> by the GC itself (and never called by the runtime). But as I said, I'm
>>>> not against a serious refactoring and tightening-up of CollectedHeap
>>>> instead.
>>>
>>> Yes, I'd like to keep CollectedHeap as the main interface, but I
>>> completely agree that CollectedHeap currently contains too much
>>> implementation stuff that we probably want to move out.
>>
>> Ok, I will revert that part of the change to use CollectedHeap as main
>> interface then. It's no big deal, so far I only had one additional
>> method for servicability support in the GC interface class anyway.
>
> Ok, sounds good.
>
> And regarding BarrierSet. As you know, Erik Österlund is working on 
> overhauling BarrierSet and how barriers are used across the VM. He'll 
> be sending out his current proposal later today.
>
>>
>> Would you also prefer keep 'management' of the heap in Universe too?
>> I.e. Universe::create_heap() and Universe::heap() etc? Or do you see a
>> benefit in moving it out like I did with gc_factory.cpp? The idea being
>> that there's only one smallish place that knows about all the existing
>> GC impls?
>
> I'd like to keep Universe::heap() and create_heap(), but I'd like to 
> move away from our current if-else if-else.. and instead have a more 
> declarative way of saying which GC's are available. create_heap() 
> would then just walk the list of available GC and ask if it's enabled 
> and if so create an instance. I think we'd want to do something 
> similar to (or even combine this with) what Erik is doing in his 
> BarrierSet patch.
>
> In general, to make it easier to review/test/integrate all these 
> changes it would be good if we can have incremental patches, each 
> addressing some specific/contained area.
>
> cheers,
> Per