GC interface update

Wed Apr 26 10:06:38 UTC 2017

Hi Roman,

On 2017-04-25 18:30, Roman Kennke wrote:
> Hi Erik,
>
> wow that is a lot of stuff coming up :-) It mostly matches what I had in
> mind, and even goes much further (still need to digest the glory details
> ;-) ). Some quick thoughts in random order:

I'm glad you like it.

> I like that you planned for generators for interpreter, c1, c2 and
> graal. I was about to start working on that, but if you're already doing
> it, I don't need to.

:)

> Regarding Shenandoah, what we need there is a way to apply read- and
> write-barriers on *every* heap access. This is used to resolve the
> target object to its to-space copy. For example, when reading a field
> from an object, and that object is in from-space, we first need to read
> its forwarding pointer to arrive at the to-space copy and read from
> there. Similary, for writes, we first need to invoke some write barrier
> magic to copy the target object to to-space, CAS the forwarding pointer
> of the from-space object, and then do the write into the to-space copy.
> In pseudocode, a store (or load) that used to look like this:
>
> store(oop obj, int offset, int value)
>
> now needs to look like this:
>
> obj = write_barrier(obj)
> store(obj, offset, value)
>
> With regards to the GC interface, this means we need to have access to
> the source (for loads) or target (for stores) object, not only the
> actual field address. Infact, a field address would be pointless, what
> we need is the object+offset.
>
> Does your proposal provide for this?

Yes it does. The HeapAccess class (Access on the Java heap) has 
store_at, load_at etc that takes a base pointer and an offset - just 
what you need.
This was in the back of my head throughout the design of the interface 
as I knew you would need this. There might be a few current uncommon 
exceptions that I have not handled yet, like the 
oopDesc::atomic_compare_exchange_oop that takes an address rather than 
base object plus pointer. Now it just forwards the call to the 
HeapAccess<MO_SEQ_CST>::oop_cas API without a base pointer, but you 
might want to change these call sites to do 
HeapAccess<MO_SEQ_CST>::oop_cas_at instead, which supplies a base pointer.

> Why don't you push all this into the jdk10-sandbox, under the
> JDK-8163329-branch (aka GC-interface-branch) ? We do need to collaborate
> on this stuff, and the best way to do that would be with actual code
> exchange. It's easy to do in the sandbox: we can go completely wild in
> there until we're satisfied ;-)

I agree we need to collaborate here. Having said that - I hope your 
version of "completely wild" is not the same as mine. ;)
I will push the code to the sandbox.

> To be honest, I wouldn't go over the top to optimize runtime barrier
> accesses. I haven't seen a single benchmark yet that suffers from
> virtual calls. Shenandoah does introduce *much* more virtual calls (in
> its current design), i.e. one for each primitive load and store, and it
> doesn't seem to impact performance or show up in profiles on benchmarks
> we are running *at all*. It seems like a complete non-issue to me. I
> suppose it is possible to construct benchmarks where it does matter
> (heavily exercising JNI heap accessors comes to mind), but even then I
> don't think virtual calls in the runtime accessors hurt that much. If
> you have such benchmark, please please let me (or us) know.

Sorry if this was not clear enough, but the main purpose of the template 
machinery was not to micro-optimize virtual calls. It is more of a nice 
bonus you get. The main purpose is being able to unite all these 
different weird accesses with special treatment due to potentially 
orthogonal semantics requiring them to be treated differently by 
different GCs. This moves the complexity from the sprinkled special GC 
treatment code all over hotspot into a contained (and limited) 
complexity for the mediator between the user of the Access API and the 
backends. But it is very easy to use both by users of the Access API and 
backends.

Having said that, I have seen statistically significant single digit 
percent performance improvements in some smaller benchmarks that did not 
get to run long enough to reach peak performance. So while the 
performance of this was not a main goal, I could at least see a 
difference. I might want to revisit that... My performance goal was not 
to make things worse.

> My idea for runtime accessors basically boiled down to the API that's
> currently in oop.hpp / oop.inline.hpp: i.e. forward all heap access
> through the barrier via 1 (and only 1) virtual call. I don't exactly
> mind some magic to avoid even this one virtual call, but I question if
> it's worth the additional complexity (which doesn't sound exactly
> negligible).

The simpler API you refer to in oop.hpp does not yet acknowledge all the 
weird accesses we do - it handles the default heap accesses on only 
strongly reachable objects and then sprinkles conditionally executed 
GC-specific barriers around these default accesses at callsites where 
there are such weird accesses, rather than supplying the semantics. I 
believe I saw this was on your TODO-list. This system is the result of 
going down that rabbit hole.

In the case of for example unsafe getObject that might require a SATB 
barrier, your interface first performs one of those normal accesses from 
oop.hpp and then conditionally checks which GC was selected and then 
conditionally inserts various GC-specific SATB barriers. One could of 
course add different virtual call accesses on oop - one for each 
permutation of semantic properties being used. But then there are 
different properties tracked by different GCs, and adding more might get 
akward. For example whether oops can be null and whether the destination 
address has been initialized or not.
And then there are all the weird root accesses with special treatment. 
Like oop stores on nmethods or klass mirrors. Or loads on JVMTI tag map 
entries, jweaks, the string table, class holders, etc. And then there 
are unsafe accesses that may or may not require special GC treatment and 
might do different stuff depending on what size atomic instructions are 
available, what GC was selected, whether it was a reference object, 
whether compressed oops are being used, whether the machine is 
non-multiple copy atomic or not... etc. This kind of sprinkled special 
handling complexity all over the place is what I want to get rid of.

My Access interface instead captures all the semantic properties of the 
access and the template machinery automatically instantiates the 
required and possible barriers for the supported GCs available in the 
build and runtime properties like compressed oops to conform to the 
semantics, and selects the right one at runtime with the function 
pointer patching. As a bonus, you get that more optimized access that 
may or may not make your application happier.

> Aleksey is
> building a GC benchmark suite, and I'm sure he'd be happy to include
> such a benchmark [1].
>
> [1] http://icedtea.classpath.org/people/shade/gc-bench/

Nice.

> We can help with the arm64 port. :-)

Thank you, very glad to hear that! :)

> Those comments are purely based on your description. Now I'm going to
> study your patch :-)

May I recommend a cup of coffee...

> Again, I think it would be great to simply push this to the sandbox.

Okay, will do.

> Also, I'm hanging out on IRC #openjdk on OFTC as rkennke.

I'm eosterlu in there.

Thanks,
/Erik

> Cheers,
> Roman
>
>
>
>
> Am 25.04.2017 um 17:35 schrieb Erik Österlund:
>> Hi everyone,
>>
>> I'm glad to see that we all want to go towards modularizing the GC
>> implementations in hotspot more. Thank you Roman for starting this
>> thread. I have wanted a better GC interface since I first set foot in
>> hotspot.
>>
>> As mentioned, I have been cooking up a GC barrier interface prototype
>> based on ideas mentioned earlier in this thread. I will provide a
>> preview of where it is headed in this email before we start to diverge
>> too much.
>>
>> I have a long patch queue with many individual changes, but to get an
>> overview for the discussion here, I will start by posting a combined
>> webrev for preview of the whole thing as a pre-review. Once the real
>> detailed review starts, I will start sending out the smaller incremental
>> changes that are easier to grasp in reviews.
>>
>> The webrev is based on the latest jdk10-hs repo.
>>
>> Full webrev: http://cr.openjdk.java.net/~eosterlund/gc_interface/webrev.00/
>>
>> == High Level Design Philosophy ==
>>
>> The overall design idea has been to remove all explicit calls to barrier
>> code in the VM. The barrier code is often required for conforming to
>> some kind of semantics. So rather than having these explicit barrier
>> calls in the shared code, the new API is instead to perform a memory
>> access that specifies the intended semantic properties instead. These
>> semantic properties are not only GC related but could be any property
>> that is important for performing a memory access with the right
>> semantics. These semantic properties are called decorators in my API.
>>
>> So for example, a decorator for a load on an oop could be ACCESS_ON_HEAP
>> to denote that this access is performed in the Java heap, and MO_ACQUIRE
>> that the access should have acquire memory ordering semantics and
>> ACCESS_ON_WEAK to denote that the access is performed on a weakly
>> reachable reference. This would in the end need to boil down to the
>> following barriers:
>> 1) Compressed oops to decode narrow oops
>> 2) Potentially an acquire membar on e.g. ARM machines
>> 3) Potentially a SATB enqueue barrier for SATB-type GCs
>>
>> So rather than treating these differently and having runtime-resolved
>> explicit barriers built bottom up, the approach is to build accesses top
>> down instead. The GC may override the whole access to to anything, but
>> will probably want to reuse things like decoding and encoding compressed
>> oops, and the pre/post write pattern and memory ordering. Therefore
>> barrier sets may ask their super class to fill in such details. This
>> allows arbitrary level of control without introducing code duplication.
>>
>> Each BarrierSet has 4 barrier-related components with a similar design.
>>
>> 1) An AccessBarrier class responsible for performing accesses requested
>> by the runtime system through a new Access API (more later)
>> 2) A BarrierSetCodeGen class responsible for generating accesses in
>> platform-specific assembly code (stub generators and interpreter)
>> 3) A C1BarrierSetCodeGen class responsible for generating accesses for
>> the C1 compiler
>> 4) A C2BarrierSetCodeGen class responsible for generating accesses for
>> the C2 compiler
>>
>> So there is one class for each part of hotspot (runtime,
>> platform-specific, c1, c2), and they all follow a class hierarchy
>> mirroring their respective BarrierSet hierarchy to reuse more general
>> functionality like memory ordering and compressed oops.
>>
>> == Runtime: Access API ==
>>
>> The runtime part of the API goes through a new class called Access. All
>> decorated accesses should go through this interface. It makes heavy use
>> of templates to perform the right accesses and barriers in the most
>> optimal way, by connecting the intended Access semantics to the
>> appropriately decorated AccessBarrier of the current BarrierSet. It
>> combines different decorators in a pipeline that are resolved at
>> different times in the JVM life cycle, but handled in the same way. Some
>> decorators are resolved at build-time, like for example whether the
>> build needs to support barriers on primitives. If Shenandoah is built
>> for example, this decorator will be set, and if Shenandoah is not built,
>> it will not be set. Other decorators are resolved statically at the call
>> site, such as what strength a reference has. Yet some decorators are
>> resolved at runtime, such as whether compressed oops are used or not and
>> which garbage collector was selected.
>>
>> When there exists runtime dependencies for resolving a barrier, the
>> Access system will generate function pointers for the access. The
>> function pointers initially point to a resolver function that checks the
>> selected runtime properties, and then patches the function pointer to
>> point to a statically generated function with those properties set, so
>> that the next time the function is called, it will call straight into
>> the appropriate barrier. This means that where we would previously have
>> multiple virtual calls for pre- and post-write barriers, followed by if
>> checks for compressed oops, all of that boils down to a single function
>> pointer call that then has inlined everything that needs to be done for
>> that set of runtime parameters.
>>
>> The goal has been to separate out GC-specific code to GC-specific
>> directories as far as barriers are concerned. To glue this together,
>> there is a barrierSetConfig.hpp and barrierSetConfig.inline.hpp. The
>> barrierSetConfig.hpp configures what barrier sets there are and produces
>> a macro allowing you to do something for each barrier set. This is used
>> by barrier resolution at runtime. The barrierSetConfig.inline.hpp
>> basically just includes in the GC-specific inline headers to allow
>> inlining the GC barriers all the way. So anyone making a new GC should
>> put their GC in there. I added a Shenandoah GC therere as an example so
>> you can see what I mean.
>>
>> The Access API goes through a template pipeline. First the Access class
>> bridges the API to functions in the AccessInternal namespace. This
>> involves using temporary proxy objects to artificially infer the return
>> types of loads. Then in the AccessInternal namespace the types are first
>> decayed, meaning that CV-qualifiers and references are stripped. Then
>> types of addresses and values are joined, at which times certain
>> decorators are infered like the use of compressed oops when e.g. loading
>> an oop from a narrowOop*. Other implicit decorarors are also inferred
>> then, such as a default memory ordering if none is specified, and other
>> rules related to memory ordering such as sequential consistent stores
>> implicitly also being releasing stores etc. Then buildtime decorators
>> are added and a pre-runtime stage is reached where the mechanism tries
>> to bind accesses statically if possible, and otherwise producing a
>> runtime-dispatch point that statically generates all possible runtime
>> variants of the access and a self-patching function pointer that
>> resolves the correct variant at runtime. These statically generated
>> barriers are resolved through the BarrierSet AccessBarrier that gives
>> the GC full control for generating an appropriate access. It can use the
>> DecoratorTest class to check for different decorators specifying
>> semantics that add barriers altering the access. Eventually, a super
>> class of the AccessBarrier called BasicAccessBarrier that handles
>> compressed oops and it calls RawAccessBarrier that inspects the decayed
>> times and forwards to appropriate calls to Atomic, OrderAccess or
>> performs volatile or raw accesses depending on selected memory ordering.
>>
>> I have then applied the Access API to many weird accesses performed in
>> the runtime system where we check if we are using G1 and then
>> subsequently doing some weird ad-hoc SATB enqueue barrier. Examples
>> include the string table, ciMetadata and jvmtiTagMap, unsafe get,
>> reference get, jweak resolve etc. These accesses now use decorated
>> accesses through Access instead.
>>
>> == C1 ==
>>
>> The shared C1 barrier code has been moved into the C1BarrierSetCodeGen
>> class for each specific barrier set. It generates decorated accesses,
>> and decorates it with GC barriers as required by the specified
>> semantics. The slowpath stubs have been refactored. The code sutbs have
>> moved into the C1BarrierSetCodeGen class and it assembles the machine
>> code with the platform specific BarrierSetCodeGen assembler. The
>> runtime1 code stubs have been changed to not be generated in switch
>> statements, but instead with a code generation closure that calls into
>> the C1BarrierSetCodeGen that calls assembles the runtime1 stub with the
>> platform specific BarrierSetCodeGen.
>>
>> The design of accesses going through C1BarrierSetCodeGen is consistent
>> with the rest of the Access API: the accesses are built top down and
>> allows overriding the whole operation. The C1BarrierSetCodeGen class
>> mirrors the class hierarchy of the BarrierSets.
>>
>> == C2 ==
>>
>> Similar to the C1BarrierSetCodeGen, the C2BarrierSetCodeGen helps the
>> GraphKit generate decorated accesses top-down. The class hierarchy of
>> the C2BarrierSetCodeGen class mirrors the class hierarchy of the
>> BarrierSets. Since C2 expands GC barriers rather early and then pulls
>> the barriers through optimizations, there are some additional calls to
>> be able to distinguish barrier-related nodes from non-barrier nodes.
>>
>> == Graal ==
>>
>> For now I only try not to break the Graal port used for AoT in the
>> hotspot repository. Ideally, graal would follow the same pattern, but
>> initially this is out of scope for me.
>>
>> == GC: BarrierSet consolidation ==
>>
>> The hierarchy of our barrier sets seem unnecessarily deep - partially
>> because the card table itself is part of the card table barrier sets. I
>> have split the card table hierarchy and separated it from the barrier
>> set hierarchy. A CardTableModRefBarrier *has* a CardTable. As a result
>> the hierarchy could be simplified a lot to contains only BarrierSet,
>> ModRefBarrierSet, CardTableBarrierSet and G1BarrierSet. G1BarrierSet and
>> CardTableBarrierSet are the only leaves, and ModRefBarrierSet is only a
>> small helper class.
>>
>> == Colaboration ==
>>
>> I hope you like the direction this is going and hope it will suit
>> Shenandoah as well. I have not yet applied the Access API for all
>> primitives yet because I thought that you probably have a better idea
>> where they are since your GC uses such barriers a lot more. But the
>> framework should be able to support that without much trouble. So I hope
>> we can work together a bit on this. If there are any shortcomings, I
>> hope we can work it out together.
>>
>> Also, as you can see, I have only provided x86 and SPARC ports so far.
>> The architecture specific code mostly involves the stub generators, the
>> interpreter, and the G1 C1 slow path stuff. I was hoping to eventually
>> get some help from other port maintainers to port this to their
>> respective platforms. If you feel compelled to help porting this to ARM,
>> I would be very happy. ;)
>>
>> And perhaps somebody would like to help getting PPC and S390 on board
>> too. I thought I would at least start the discussion now.
>>
>> So yeah, hope everyone likes this direction. If there are any questions,
>> I will happily answer them. Any feedback is very welcome.
>>
>> Thanks,
>> /Erik
>>
>> On 2017-04-25 14:05, Per Liden wrote:
>>> On 2017-04-24 15:46, Roman Kennke wrote:
>>>> Am 24.04.2017 um 08:37 schrieb Per Liden:
>>>>> On 04/20/2017 02:29 PM, Roman Kennke wrote:
>>>>>> Am 20.04.2017 um 14:01 schrieb Per Liden:
>>>>>>> On 2017-04-20 12:05, Aleksey Shipilev wrote:
>>>>>>>> On 04/20/2017 09:37 AM, Kirk Pepperdine wrote:
>>>>>>>>>> Good stuff. However, one thing I'm not quite comfortable with
>>>>>>>>>> is the
>>>>>>>>>> introduction of the GC class (and its sub classes). I don't quite
>>>>>>>>>> see the
>>>>>>>>>> purpose of this interface split-up between GC and CollectedHeap. I
>>>>>>>>>> view
>>>>>>>>>> CollectedHeap as _the_ interface (but yes, it needs some love),
>>>>>>>>>> and
>>>>>>>>>> as a
>>>>>>>>>> result I think the the functions you've exposed in the GC class
>>>>>>>>>> actually
>>>>>>>>>> belongs in CollectedHeap.
>>>>>>>>> I thought the name CollectedHeap implied the state of the heap
>>>>>>>>> after the
>>>>>>>>> collector has completed. What is the intent of CollectedHeap?
>>>>>>>> No, CollectedHeap is the actual current GC interface. This is the
>>>>>>>> entry point to
>>>>>>>> GC as far as the rest of runtime is concerned, see e.g.
>>>>>>>> CollectedHeap*
>>>>>>>> Universe::create_heap(), etc. Implementing CollectedHeap,
>>>>>>>> CollectorPolicy, and
>>>>>>>> BarrierSet are the bare minimum required for GC implementation
>>>>>>>> today. [1]
>>>>>>> Yep, and I'd like us to move towards tightening down the GC
>>>>>>> interface to
>>>>>>> basically be cleaned up versions of CollectedHeap and BarrierSet.
>>>>>>>
>>>>>>> CollectorPolicy and some other things that class drags along, like
>>>>>>> AdaptiveSizePolicy, are way too collector specific and I don't think
>>>>>>> that should be exposed to the rest of the VM.
>>>>>> Right, I totally agree with this.
>>>>>>
>>>>>> BTW, another reason for making a new GC interface class instead of
>>>>>> further bloating CollectedHeap as the central interface was that there
>>>>>> is way too much implementation stuff in CollectedHeap. Ideally, I'd
>>>>>> like
>>>>>> to have a true interface with no or only trivial implementations
>>>>>> for the
>>>>>> declared methods, and most importantly nothing that's only ever needed
>>>>>> by the GC itself (and never called by the runtime). But as I said, I'm
>>>>>> not against a serious refactoring and tightening-up of CollectedHeap
>>>>>> instead.
>>>>> Yes, I'd like to keep CollectedHeap as the main interface, but I
>>>>> completely agree that CollectedHeap currently contains too much
>>>>> implementation stuff that we probably want to move out.
>>>> Ok, I will revert that part of the change to use CollectedHeap as main
>>>> interface then. It's no big deal, so far I only had one additional
>>>> method for servicability support in the GC interface class anyway.
>>> Ok, sounds good.
>>>
>>> And regarding BarrierSet. As you know, Erik Österlund is working on
>>> overhauling BarrierSet and how barriers are used across the VM. He'll
>>> be sending out his current proposal later today.
>>>
>>>> Would you also prefer keep 'management' of the heap in Universe too?
>>>> I.e. Universe::create_heap() and Universe::heap() etc? Or do you see a
>>>> benefit in moving it out like I did with gc_factory.cpp? The idea being
>>>> that there's only one smallish place that knows about all the existing
>>>> GC impls?
>>> I'd like to keep Universe::heap() and create_heap(), but I'd like to
>>> move away from our current if-else if-else.. and instead have a more
>>> declarative way of saying which GC's are available. create_heap()
>>> would then just walk the list of available GC and ask if it's enabled
>>> and if so create an instance. I think we'd want to do something
>>> similar to (or even combine this with) what Erik is doing in his
>>> BarrierSet patch.
>>>
>>> In general, to make it easier to review/test/integrate all these
>>> changes it would be good if we can have incremental patches, each
>>> addressing some specific/contained area.
>>>
>>> cheers,
>>> Per
>