RFR: 8199417: Modularize interpreter GC barriers

Wed Apr 4 20:58:44 UTC 2018

Hi Roman,

On 2018-04-04 22:52, Roman Kennke wrote:
> Hi Erik,
>
> the changes make sense to me, and look good.

Thanks.

> I am thinking ahead a little bit and am wondering how to fit primitive
> heap access into it. Will we have to add one load_heap_X for each
> primitive type X, or call access_load_at() directly from all the places
> in the interpreter?

You would call access_load_at directly, I think, and add some backend 
where there are now asserts in BarrierSetAssembler to perform the 
primitive accesses, and stick some brooks pointers juice in your 
ShenandoahBarrierSetAssembler.

/Erik

> Thanks,
> Roman
>
>> Hi Kim,
>>
>> Thank you for reviewing this.
>>
>> I have made a new webrev with proposed changes to the x86/x64 code
>> reflecting the concerns you and Coleen have.
>> I thought we should settle for something we like in the x86/x64 code
>> before going ahead and changing the other platforms too much (I don't
>> want to bug Martin and Roman too much before).
>>
>> New full webrev:
>> http://cr.openjdk.java.net/~eosterlund/8199417/webrev.01/
>>
>> Incremental:
>> http://cr.openjdk.java.net/~eosterlund/8199417/webrev.00_01/
>>
>> On 2018-04-01 05:12, Kim Barrett wrote:
>>>> On Mar 26, 2018, at 5:31 AM, Erik Österlund
>>>> <erik.osterlund at oracle.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> The GC barriers for the interpreter are not as modular as they could
>>>> be. They currently use switch statements to check which GC barrier
>>>> set is being used, and call this or that barrier based on that, in a
>>>> way that assumes GCs only use write barriers.
>>>>
>>>> This patch modularizes this by generating accesses in the interpreter
>>>> with declarative semantics. Accesses to the heap may now use store_at
>>>> and load_at functions of the BarrierSetAssembler, passing along
>>>> appropriate arguments and decorators. Each concrete
>>>> BarrierSetAssembler can override the access completely or sprinkle
>>>> some appropriate GC barriers as necessary.
>>>>
>>>> Big thanks go to Martin Doerr and Roman Kennke, who helped plugging
>>>> this into S390, PPC and AArch64 respectively.
>>>>
>>>> Webrev:
>>>> http://cr.openjdk.java.net/~eosterlund/8199417/webrev.00/
>>>>
>>>> Bug:
>>>> https://bugs.openjdk.java.net/browse/JDK-8199417
>>>>
>>>> Thanks,
>>>> /Erik
>>> I've only looked at the x86 and generic code so far. I have some
>>> comments, but also a bunch of questions. I think I'm missing some
>>> things somewhere. I'm sending what I've got so far anyway, and will
>>> continue studying.
>>>
>>> ------------------------------------------------------------------------------
>>>
>>> src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp
>>> 354   // flatten object address if needed
>>> 355   // We do it regardless of precise because we need the registers
>>>
>>> There is no "precise" in this context.
>> I thought "precise" was referring to the concept of precise card
>> marking, as opposed to the variable 'precise' that used to exist. I have
>> reworded the comment to reflect that.
>>
>>> ------------------------------------------------------------------------------
>>>
>>> src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp
>>> 364 #ifndef _LP64
>>> 365   InterpreterMacroAssembler *imasm =
>>> static_cast<InterpreterMacroAssembler*>(masm);
>>> 366 #endif
>>>
>>> In the old code, the function received an InterpreterMacroAssembler*.
>>>
>>> What is there about this context that lets us know the downcast is
>>> safe here?  Is oop_store_at only ever called with an IMA*?  If so,
>>> then why isn't that the parameter type.  If not, then what?
>> Yes, this is indeed only used by the InterpreterMacroAssembler. But I
>> wanted the interface to use MacroAssembler. Why?
>>
>> 1) It's only 32 bit x86 that relies on this property, and I hope it will
>> go away soon, and the save bcp hack with it.
>> 2) previous load_heap_oop is used not only in the
>> InterpreterMacroAssembler, and I wanted the same assembler in load_at
>> and store_at.
>>
>> So for now it is only used in InterpreterMacroAssembler, and I doubt
>> that will change any time soon. I am hoping 32 bit support will be
>> dropped before that, and the hack will go away. For now, I have added a
>> virtual is_interpreter_masm() call that I use in an assert to make sure
>> this is not accidentally used in the wrong context until 32 bit support
>> is gone.
>>
>>> ------------------------------------------------------------------------------
>>>
>>> src/hotspot/cpu/x86/templateInterpreterGenerator_x86.cpp
>>> 753   bs->load_at(_masm, IN_HEAP | ON_WEAK_OOP_REF, T_OBJECT,
>>> 754               rax, field_address, /*tmp1*/ rbx, /*tmp_thread*/ rdx);
>>>
>>> This needs to distinguish between WEAK and PHANTOM references, else it
>>> will break ZGC's concurrent reference processing.  (At least so long as
>>> we still have to deal with finalization.)
>> It does. This is ON_WEAK_OOP_REF. The get intrinsic is generated for
>> WeakReference (and SoftReference) only. PhantomReference does not have a
>> getter. Therefore, in all contexts this is intrinsified, it will refer
>> to an ON_WEAK_OOP_REF.
>>
>>> ------------------------------------------------------------------------------
>>>
>>> src/hotspot/cpu/x86/templateInterpreterGenerator_x86.cpp
>>> 698 address
>>> TemplateInterpreterGenerator::generate_Reference_get_entry(void) {
>>>
>>> Almost all the code here (and much of the commentary) was/is specific
>>> to G1 (or SATB collectors).  Shouldn't it all be in the barrier_set's
>>> load_at?  As it is now, if (say) we're using Parallel GC then I think
>>> there's a bunch of additional code being generated and executed here
>>> that wasn't present before.
>> It is true that interpreter Reference.get() intrinsics with
>> Parallel/CMS/Serial currently run a few instructions more than they need
>> to pay for the abstraction. There is no doubt in my mind that this
>> abstraction is worth the cost. Reference.get in the interpreter is not a
>> hot path, and does not need to optimize every cycle.
>>
>> I changed the G1 comments to more general comments instead.
>>
>>> ------------------------------------------------------------------------------
>>>
>>> src/hotspot/cpu/x86/methodHandles_x86.cpp
>>> 170   __ load_heap_oop(method_temp, Address(recv,
>>> NONZERO(java_lang_invoke_MethodHandle::form_offset_in_bytes())));
>>> =>
>>> 175   bs->load_at(_masm, IN_HEAP, T_OBJECT, method_temp, Address(recv,
>>> NONZERO(java_lang_invoke_MethodHandle::form_offset_in_bytes())), t
>>>
>>> This doesn't look like an improvement in code readability to me.  Why
>>> can't bs->load_at be the implementation of "__ load_heap_oop" ?
>>>
>>> And the lines are getting really long; so long that webrev is cutting
>>> off the right end, so I can't review the new versions of the lines.
>> Due to popular demand, I have done what you (and Coleen) both proposed:
>> retained the load_heap_oop abstraction.
>>
>> Before:
>> * load_heap_oop was the "raw" variant, similar to
>> oopDesc::load_decode_heap_oop, which no longer exists.
>>
>> Now:
>> * load_heap_oop calls MacroAssembler::access_load_at
>> * access_load_at grabs the active BarrierSetAssembler and calls load_at
>> on it.
>> ... same idea for stores.
>> * What used to be the raw versions for oop loads and stores, has moved
>> to BarrierSetAssembler::load_at/store_at.
>>
>> These accessors now optionally take temporary register and decorator
>> parameters than you don't have to supply:
>> * supplying temprary register is polite to the GC so its accesses can be
>> a bit better
>> * supplying decorators can be done if it is not a default access. For
>> example, AS_RAW can now be used to perform a raw load instead.
>>
>>> ------------------------------------------------------------------------------
>>>
>>> src/hotspot/cpu/x86/methodHandles_x86.cpp
>>> 363         __ load_heap_oop(temp2_defc, member_clazz);
>>> =>
>>> 368         BarrierSetAssembler* bs =
>>> BarrierSet::barrier_set()->barrier_set_assembler();
>>> 369         Register rthread = LP64_ONLY(r15_thread) NOT_LP64(noreg);
>>> 370         bs->load_at(_masm, IN_HEAP, T_OBJECT, temp2_defc,
>>> member_clazz, noreg, rthread);
>>>
>>> Similarly, this doesn't seem like an improvement to me.  And there are
>>> more like this.
>> Fixed.
>>
>>> ------------------------------------------------------------------------------
>>>
>>> src/hotspot/cpu/x86/macroAssembler_x86.hpp
>>>
>>> It would reduce the clutter in this change to leave the as_Address
>>> declarations where they were, since that place is now public, rather
>>> than moving them to a different public section.
>> Fixed.
>>
>>> ------------------------------------------------------------------------------
>>>
>>>
>>> Generally, it seems like it might be nice for the MacroAssembler class
>>> to have a reference to the barrier_set, or maybe even the bs's
>>> assembler, rather than looking them up in various places, and in
>>> different ways.  For example, I see each of
>>>
>>>     Universe::heap()->barrier_set()
>>>     BarrerSet::barrier_set()
>>>
>>> used in various different places *in new code*.  Some consistency
>>> would be nice.
>> I think it seems fine to just fetch them only in the
>> access_load/store_at wrappers.
>>
>>> ------------------------------------------------------------------------------
>>>
>>> src/hotspot/cpu/x86/macroAssembler_x86.cpp
>>> 5249   bs->load_at(this, IN_ROOT | ON_PHANTOM_OOP_REF, T_OBJECT,
>>> 5250               value, Address(value, -JNIHandles::weak_tag_value),
>>> tmp, thread);
>>> 5251   verify_oop(value);
>>>
>>> This puts the verify_oop after the G1 pre-barrier, where in the old
>>> code verify_oop was before the pre-barrier. I remember the old order
>>> being intentional, though don't off-hand recall how important that
>>> was.  If for no other reason, it seems like the new order could rarely
>>> (due to bad timing) have a bad oop reported somewhere far away during
>>> SATB buffer processing, rather than at the point of the load.
>>>
>>> With the new version's order there could be one verify after the done
>>> label.  That's just a minor nit.
>> Conversely, I think the new order is better and more GC agnostic.
>> It avoids, for example, sending in bad oops for verification with ZGC.
>> The more GC-agnostic interface is that after the load_at
>>
>>> ------------------------------------------------------------------------------
>>>
>>> src/hotspot/cpu/x86/gc/g1/g1BarrierSetAssembler_x86.cpp
>>> 230 #ifdef _LP64
>>> 231     if (c_rarg1 != thread) {
>>> 232       __ mov(c_rarg1, thread);
>>> 233     }
>>> 234     if (c_rarg0 != pre_val) {
>>> 235       __ mov(c_rarg0, pre_val);
>>> 236     }
>>> 237 #else
>>> 238     __ push(thread);
>>> 239     __ push(pre_val);
>>> 240 #endif
>>>
>>> Old code used:
>>>
>>>     __ pass_arg1(thread);
>>>     __ pass_arg0(pre_val);
>>>
>>> Why is this being changed?  Oh, pass_argN are static helpers not
>>> available outside macroAssembler_x86.cpp.  Maybe they should be
>>> exposed in some way, like public helpers in the x86 version of the
>>> MacroAssembler class, rather than copied inline here (and maybe other
>>> places?).
>> Fixed. (added a wrapper for this to MacroAssembler, as suggested)
>>
>>> ------------------------------------------------------------------------------
>>>
>>> src/hotspot/cpu/x86/interp_masm_x86.cpp
>>> Deleted these lines:
>>> 519   // The resulting oop is null if the reference is not yet resolved.
>>> 520   // It is Universe::the_null_sentinel() if the reference resolved
>>> to NULL via condy.
>>>
>>> I don't think those comment lines should be deleted.
>> Fixed.
>>
>>> ------------------------------------------------------------------------------
>>>
>>> src/hotspot/cpu/x86/gc/shared/modRefBarrierSetAssembler_x86.cpp
>>> 86 void ModRefBarrierSetAssembler::store_at(MacroAssembler* masm,
>>> DecoratorSet decorators, BasicType type,
>>> 87                                        Address dst, Register val,
>>> Register tmp1, Register tmp2) {
>>> 88   if (type == T_OBJECT || type == T_ARRAY) {
>>> 89     oop_store_at(masm, decorators, type, dst, val, tmp1, tmp2);
>>> 90   } else {
>>> 91     BarrierSetAssembler::store_at(masm, decorators, type, dst, val,
>>> tmp1, tmp2);
>>> 92   }
>>> 93 }
>>>
>>> Doesn't the conditional conversion from store_at to oop_store_at
>>> actually indicate a bug in the caller? That is, calling store_at with
>>> a oop (T_OBJECT or T_ARRAY) value seems like a bug, and should be
>>> asserted against, rather than translating.
>>>
>>> And oop_store_at should assert that the value *is* an oop value.
>>>
>>> And should there be any verification that the decorators and the type
>>> are consistent?
>>>
>>> But maybe I'm not understanding the protocol around oop_store_at?
>>> There being no documentation, it seems entirely possible that I've
>>> misunderstood something here.  Maybe I'm being confused by
>>> similarities in naming with Access:: that don't hold?
>> The confusion roots in that Access and BarrierSetAssembler are not
>> structured exactly the same.
>>
>> Access provides store_at and oop_store_at. The reason for providing both
>> of these, is that I could not safely infer whether the passed in
>> parameters were oops or not without changing oop and narrowOop to always
>> be value objects, which I did not dare to do as the generated code was
>> worse.
>>
>> In BarrierSetAssembler, there is no such restriction. The type for the
>> access is always passed in to store_at/load_at as BasicType. It is the
>> responsibility of ModRefBarrierSetAssembler to filter away non-oop
>> references, and call oop_store_at for relevant accesses for ModRef. This
>> follows the same pattern that the arraycopy stubs used. This allows,
>> e.g. CardTableModRef to only override oop_store_at. This is similar to
>> how ModRefBarrierSet::AccessBarrier<decorators,
>> BarrierSetT>::oop_store_in_heap calls the concrete barriersets for help
>> generating that access. I hope this helps understanding the structuring.
>>
>> This is also the reason why G1BarrierSetAssembler::load_at checks
>> whether it is an oop that is loaded or not, because loads are not riding
>> on the ModRef layer, which only knows about oop stores.
>>
>>> ------------------------------------------------------------------------------
>>>
>>> src/hotspot/cpu/x86/gc/shared/cardTableBarrierSetAssembler_x86.cpp
>>> 88 void CardTableBarrierSetAssembler::store_check(MacroAssembler*
>>> masm, Register obj, Address dst) {
>>> 89   // Does a store check for the oop in register obj. The content of
>>> 90   // register obj is destroyed afterwards.
>>>
>>> The comment states obj is an oop, but when called by the precise case
>>> of oop_store_at, it is not an oop, but rather a derived pointer.  I
>>> think it's the comment that's wrong.
>> I changed the comment to refer to obj as an oop* instead of an oop.
>>
>>> ------------------------------------------------------------------------------
>>>
>>>
>>> I was surprised that oop_store_at is initially declared at the
>>> ModRefBarrierSetAssembler level, rather than at the
>>> BarrierSetAssembler level.
>>>
>>> I was also surprised to not see an oop_load_at.
>>>
>>> In Access we have _at suffixed operations paired with non-_at suffixed
>>> operations, the difference being the _at suffixed operations have a
>>> base object, to allow dealing with the brooks-ptr. I'm not finding any
>>> such distinction in the xxxAssembler API. Presumably I've missed
>>> something somewhere. How is that handled?
>> I think I previously already explained this one. ModRef does not know
>> about oop loads, only stores. Therefore, G1 must override load_at, and
>> not oop_load_at.
>>
>> Thanks,
>> /Erik
>>
>>> ------------------------------------------------------------------------------
>>>
>>>
>>>
>