[External] : Re: premain: Possible solutions to use runtime G1 region grain size in AOT code (JDK-8335440)

Thu Aug 15 22:30:37 UTC 2024

On 8/15/24 3:06 PM, Vladimir Kozlov wrote:
> No.
> 
> Why you still use base+offsets for constants address? Is it for Aarch64?
> 
> What I want is something like this:
> 
> static address grain_shift_address() { return 
> &_aot_runtime_constants::_grain_shift; }
> 
> And you will not need `#if INCLUDE_CDS` if you put AOTRuntimeConstants 
> into SCCache.cpp.
> 
> I don't see the need to separate create_field_address_list() code from 
> field_addresses_list().
> 
> Add assert to field_addresses_list() to check that values are initialized.

Actually you don't need assert since we need only addresses of fields.

I don't see use of aot_runtime_constants_at() anymore.

Also why you need aot_runtime_constants()? Why not have 
_aot_runtime_constants as static field of AOTRuntimeConstants class 
which you can initialize in (again) SCCache.cpp?

Vladimir K

> 
> Vladimir K
> 
> On 8/15/24 1:43 PM, Ashutosh Mehra wrote:
>> Hi Vladimir,
>> Here is the updated patch. We have refactored the code as per your 
>> suggestions.
>> https://github.com/openjdk/leyden/compare/premain...ashu-mehra:leyden:JDK-8335440-load-via-codecache?expand=1 <https://urldefense.com/v3/__https://github.com/openjdk/leyden/compare/premain...ashu-mehra:leyden:JDK-8335440-load-via-codecache?expand=1__;!!ACWV5N9M2RV99hQ!KJr_KHIS5usVHG5WgzaX4UOvAVI4Y2Z-ZoblB2fhuGsj2i-89W0RoSddeMDGtpieLZydGC4pagdnoaYt2zbt_w$>
>>
>> Let us know if you have any more suggestions.
>>
>> Thanks,
>> - Ashutosh Mehra
>>
>>
>> On Thu, Aug 15, 2024 at 11:02 AM Vladimir Kozlov 
>> <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote:
>>
>>       > Ashu is proposing adding support for these latter instructions
>>     to C1 so
>>       > the barrier code does not have to pin the shift count to ECX. 
>> Do we
>>       > care? Or can we live with the effect of pinning?
>>
>>     We only care about correctness currently. Performance (registers
>>     spilling) is not issue. And I am concern about size and complexity of
>>     Leyden specific changes which eventually will be pushed into 
>> mainline.
>>     Unless you see benefits for new C1 shift instructions in mainline I
>>     would go with using ECX. And again, do this if code for using ECX is
>>     simpler than adding new instructions. If it is not - use new
>>     instructions.
>>
>>     I was hopping that "Late Barrier Expansion for G1" JEP may help here
>>     too
>>     but it has only C2 changes :(
>>
>>     I agree with updating this patch and may be even pushing it into
>>     premain
>>     (for testing) regardless "Late Barrier Expansion for G1" JEP status.
>>
>>     Thanks,
>>     Vladimir K
>>
>>
>>     On 8/15/24 3:18 AM, Andrew Dinn wrote:
>>      > On 15/08/2024 00:50, Vladimir Kozlov wrote:
>>      >> Do we really need AOTRuntimeConstants instance be a stub in
>>     CodeCache?
>>      >> Looking on this code and I think this complicate implementation
>>     if use
>>      >> ExternalAddress relocation anyway. Can it be static instance in
>>      >> SCCache.cpp which we initialize the same way as in current 
>> changes?
>>      >
>>      > Yes, you are right: we don't need the AOTRuntimeConstants struct
>>     to be
>>      > embedded in the cache if we are using an ExternalRelocation. The
>>     initial
>>      > idea for doing that was because it provides a 32-bit bound on the
>>     offset
>>      > from any struct field to a corresponding field load (lea). That
>>     would
>>      > allow us to use PC-relative loads of the field addresses -- which
>>     saves
>>      > instruction footprint on AArch64, possibly also on x86? However,
>>     we can
>>      > only do that if we also implement a new relocation (which seems
>>      > unnecessary).
>>      >
>>      > So, we will redefine the AOTRuntimeConstants struct as a C++
>>     static field.
>>      >
>>      >> I do not understand why you have concern about having separate
>>      >> relocation for each constant in AOTRuntimeConstants. We have 
>> only 2
>>      >> now. Even they grow I don't think we will have a lot of them.
>>     Actually
>>      >> the question: will it simplify code if each constant's address is
>>      >> treated as separate and not offset in AOTRuntimeConstants?
>>     Instead of
>>      >> loading AOTRuntimeConstants address and then adding offset we 
>> just
>>      >> load address of constant.
>>      >
>>      > Yes, you are right again: if we register the address of every
>>     field in
>>      > the struct in the SCC address table then that will simplify the
>>     back end
>>      > code. I was not sure how much we wanted to rely on address table
>>      > registration but given that we *are* reliant on it just now 
>> then we
>>      > might as well have the benefit.
>>      >
>>      > So, we will do the extra registration and limit the back end
>>     changes to
>>      > detect ConP occurrences in the AOTRuntimeConstants range and load
>>     them
>>      > using an ExternalAddress reloc.
>>      >
>>      >> I don't understand question about C1 (I am not familiar with
>>     it). What
>>      >> it generates for shifts now? Why you need 3-operand shift?
>>      >
>>      > I'm not sure '3-operand' is the right term to use to describe
>>     what is at
>>      > stake here. It's really about the kind of operand rather than 
>> their
>>      > number. The current region/card shift operations employ an 
>> immediate
>>      > shift count. The AOT region/card shift operations need to be 
>> fed the
>>      > shift count via a register (the target of the constant load).
>>      >
>>      > C1 does currently generate shifts where the shift count is in a
>>     register
>>      > but it only does so using the flavours of shift (sar/sal/shr/shl)
>>     that
>>      > require the register to be ECX. It doesn't currently generate the
>>      > alternative instructions (sarx/shlx/shrx) that accept a shift
>>     count in
>>      > any register. I'm not 100% sure why but I think this is because
>>     not all
>>      > CPUs implement them (they require CPUID feature BM2) or it may be
>>     to do
>>      > with them setting/not setting flags.
>>      >
>>      > Ashu is proposing adding support for these latter instructions to
>>     C1 so
>>      > the barrier code does not have to pin the shift count to ECX. 
>> Do we
>>      > care? Or can we live with the effect of pinning?
>>      >
>>      >> Code style: I noticed and aarch64.ad
>>     
>> <https://urldefense.com/v3/__http://aarch64.ad__;!!ACWV5N9M2RV99hQ!KJr_KHIS5usVHG5WgzaX4UOvAVI4Y2Z-ZoblB2fhuGsj2i-89W0RoSddeMDGtpieLZydGC4pagdnoaYhejMFCA$> still using old stile of
>>      >> instructions definition using `enc_class`. We don't do this any
>>     more
>>      >> (at least in x86*.ad) when we adding new instructions with one
>>      >> implementation. Please use `ins_encode %{ %}` in such cases.
>>      >
>>      > Sure, we will fix that.
>>      >
>>      >> I think, if you need additional register you should push/pop to
>>      >> guarantee we don't stomp over some used value.
>>      > Ok, we will play safe and do that.
>>      >
>>      > One final comment: Much of this discussion is going to be
>>     side-stepped
>>      > by the G1 late barrier implementation that is currently working
>>     its way
>>      > into mainline. That will remove the need to inject barrier IR
>>     subgraphs
>>      > into the C1 and C2 IR graphs and eliminate most (all?) of the
>>     associated
>>      > C2 reduction rules from the ad files. I think we should still
>>     update the
>>      > patch to accommodate the feedback above. We can simplify it
>>     further once
>>      > the G1 changes come in from mainline and pursue in parallel the
>>     idea of
>>      > accommodating changes to oop encodings using this model.
>>      >
>>      > regards,
>>      >
>>      >
>>      > Andrew Dinn
>>      > -----------
>>      >
>>