[External] : Re: premain: Possible solutions to use runtime G1 region grain size in AOT code (JDK-8335440)

Wed Aug 14 23:50:16 UTC 2024

Do we really need AOTRuntimeConstants instance be a stub in CodeCache?
Looking on this code and I think this complicate implementation if use 
ExternalAddress relocation anyway. Can it be static instance in 
SCCache.cpp which we initialize the same way as in current changes?

I do not understand why you have concern about having separate 
relocation for each constant in AOTRuntimeConstants. We have only 2 now. 
Even they grow I don't think we will have a lot of them. Actually the 
question: will it simplify code if each constant's address is treated as 
separate and not offset in AOTRuntimeConstants? Instead of loading 
AOTRuntimeConstants address and then adding offset we just load address 
of constant.

I don't understand question about C1 (I am not familiar with it). What 
it generates for shifts now? Why you need 3-operand shift?

Code style: I noticed and aarch64.ad still using old stile of 
instructions definition using `enc_class`. We don't do this any more (at 
least in x86*.ad) when we adding new instructions with one 
implementation. Please use `ins_encode %{ %}` in such cases.

I think, if you need additional register you should push/pop to 
guarantee we don't stomp over some used value.

Thanks,
Vladimir

On 8/14/24 12:19 PM, Ashutosh Mehra wrote:
> Hi Vladimir,
> 
> I have done a port of Andrew D's patch for version 4 of the solution to 
> x8-64.
> Here is the link to the change sets on top of Andrew's original 
> patch for v4.
> https://github.com/openjdk/leyden/compare/premain...ashu-mehra:leyden:JDK-8335440-load-via-codecache?expand=1 <https://urldefense.com/v3/__https://github.com/openjdk/leyden/compare/premain...ashu-mehra:leyden:JDK-8335440-load-via-codecache?expand=1__;!!ACWV5N9M2RV99hQ!MMdG9tP3z_X4-DsBZBIi5944rqlHAPv9JkmNuc2W8V0r_lUjDKz7AVe15KyfC3C92KLfpIoSzNFXM103WlW7jg$>
> 
> Can you please review the patch? Any feedback is appreciated.
> 
> Few things worth calling out:
> 1. As Andrew D mentioned in his previous mail, C2 coalesces a pointer 
> constant load followed by an add to an updated
> pointer constant load. And because C2 on x86 generates external_word 
> reloc for every constant pointer loaded,
> it means we would have to add every address in AOTRuntimeConstants range 
> to the SCAddressTable,
> To avoid this, I used the same mechanism as aarch64 to identify the 
> AOTRuntimeConstants addresses in the back-end.
> 
> 2. C1 doesn't support use of 3-operand shift instruction (sarx/shlx/shrx).
> I extended C1 to lower shift ops using these instructions if possible. 
> Does this make sense?
> 
> 3. In G1BarrierSetAssembler::g1_write_barrier_post() the tmp2 register 
> is not available for use.
> The caller G1BarrierSetAssembler::oop_store_at uses tmp2 for storing 
> uncompressed oop.
> Also, rscratch2 is not available as it is passed as tmp1 which is also 
> in use.
> So we don't really have any free registers left.
> In my patch I resorted to using rscratch1. Is there a need to spill that 
> register to stack before using it?
> 
> Thanks,
> - Ashutosh Mehra
> 
> 
> On Wed, Jul 31, 2024 at 4:08 PM Vladimir Kozlov 
> <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote:
> 
>     Very nice. I like it.
> 
>     Thanks,
>     Vladimir K
> 
>     On 7/31/24 8:23 AM, Andrew Dinn wrote:
>      > I now have a fourth solution to the region resizing problem which
>     I think is probably the nicest looking one so far:
>      >
>      >
>      >
>     https://urldefense.com/v3/__https://github.com/adinn/leyden/compare/premain...adinn:leyden:JDK-8335440-load-via-codecache?expand=1__;!!ACWV5N9M2RV99hQ!JI8n2MMyF844NBqR63MhWXggKzPCoouZ8pcwgasAa94iYPRzpcD7af-zSicg5Wm0nv9ZyaF1CpaC74Mqn3g$ <https://urldefense.com/v3/__https://github.com/adinn/leyden/compare/premain...adinn:leyden:JDK-8335440-load-via-codecache?expand=1__;!!ACWV5N9M2RV99hQ!JI8n2MMyF844NBqR63MhWXggKzPCoouZ8pcwgasAa94iYPRzpcD7af-zSicg5Wm0nv9ZyaF1CpaC74Mqn3g$>
>      > Opinions welcome!
>      >
>      > This version relies on a code cache AOT data area, generated into
>     the initial stubs blob, which stores runtime constants
>      > that parameterize operation of AOT code. The AOT data area is
>     initialized with the current runtime constants immediately
>      > after universe init. AOT code loads the constants indirectly via
>     the region base address rather than embedding them into
>      > instructions. Non-AOT code can continue to use these constants as
>     immediate values.
>      >
>      > With this patch I have included the region grain size shift and
>     the card table shift in the data area. So, this patch
>      > ensures that AOT code loads both grain shift and card shift
>     counts rather than using immediates. I tested it passing
>      > -XX:GCCardSizeInBytes=128 -Xmx4M to a production run and it works
>     fine.
>      >
>      > The only address that needs relocating wit this patch is the AOT
>     data area base. Note that it was useful using this to
>      > implement loading of two AOT code parameters as it highlights a
>     general problem that is not face with only one field.
>      >
>      > I decide to register only the base address of the AOT data area
>     with the code cache as a relocatable (external) address
>      > because this is more robust than requiring registration of the
>     address of every data field embedded in the AOT data
>      > area. However, in order for the AOT cache to be able to relocate
>     loads from the data area they need to be implemented by
>      > installing the base address in a register and then adding the
>     offset before executing the load.
>      >
>      > This is fine for stub code where the generator can simply plant a
>     load effective address (lea) of a pointer constant
>      > into a register and then add in the offset. However, I know C2
>     Value optimizations will certainly coalesce a pointer
>      > constant load followed by an add to an updated pointer constant
>     load. I think this can also happen with C1 (it doesn't
>      > at the moment because in the current C1 code the grain shift load
>     -- at offset 0 -- happens inline, while the card shift
>      > load -- at offset 4 -- only happens in a callout to stub code). I
>     fixed this general problem by ensuring that the C1 and
>      > C2 back ends detect pointer constants that lie anywhere in the
>     data area address *range*. For (only) those cases the lea
>      > is implemenetd by generating a base address load plus offset add.
>      >
>      > It would be nice to be ale to fold the offset into the subsequent
>     load. I'll look into that as a follow-up. Either way,
>      > I think this looks like the simplest and most robust solution to
>     ensuring that AOT code is correctly parameterized at
>      > runtime. It is particularly nice that it doesn't need any new
>     relocations.
>      >
>      > regards,
>      >
>      >
>      > Andrew Dinn
>      > -----------
>      >
>      > On 24/07/2024 20:43, Vladimir Kozlov wrote:
>      >> Thank you for checking code for card_table_address.
>      >>
>      >>  > Oh, that's a nuisance. I was concerned that I was perhaps
>     using the wrong reloc model here. Should I be encoding info
>      >>  > about what/how to patch the target instruction(s) using reloc
>     data?
>      >>
>      >> I am currently thinking about adding indication what relocation
>     is pointing on: blob, stubs, external address, string.
>      >> Currently we are looking through all our address tables in
>     SCCache to find match. Would be nice if we can get this
>      >> information from relocation. And I also thought about using
>     format for it but realized it is PD and it reserves bits
>      >> in RelocInfo word. On other hand we can add an other 12 (31 - 18
>     - 1) new relocations without using additional bits.
>      >>
>      >> In short add new relocation if you need it.
>      >>
>      >> Thanks,
>      >> Vladimir K
>      >>
>      >> On 7/24/24 7:40 AM, Andrew Dinn wrote:
>      >>> On 23/07/2024 18:44, Vladimir Kozlov wrote:
>      >>>> I agree. As we discussed on meeting we may use it to patch
>     compressed oops/klass code too.
>      >>>
>      >>> Yes, we could potentially do something similar to adjust the
>     shift and base adjustment employed when generating a
>      >>> narrow oop or klass encode or decode. However, that would only
>     be useful with certain limits.
>      >>>
>      >>> 1) We can only cater for a change from one narrow-oop/klass
>     base+shift configuration to another base+shift
>      >>> configuration. Patching code to switch from narrow oops/klass
>     pointers to full width (or vice versa) would require
>      >>> adjusting not just the oop/klass load/store but also other
>     instructions that hard-wire header/payload field sizes and
>      >>> offsets based on the narrow/full width layout assumptions.
>     Likewise it would require resizing and repacking objects
>      >>> stored in the CDS mapped heap section.
>      >>>
>      >>> 2) Even if we restrict ourselves to the case where we retain
>     narrow oops and simply allow for a change of base and/or
>      >>> shift (i.e. keep the object layouts the same), that will only
>     work if pointers installed in mapped CDS heap objects
>      >>> are also rewritten to use the runtime encoding.
>      >>>
>      >>> 3) When generating AOT code for the encode and decode we would
>     have to allow for the worst case i.e. leave room
>      >>> (nops) for both heap-base adjustment and shift even when they
>     are not needed at generate time.
>      >>>
>      >>>> We have the same issue with card_table_address. I actually
>     don't know why current AOT code works??? May be because
>      >>>> it is the same value in training and production runs. But we
>     need to fix it too.
>      >>>
>      >>>> It is constant node which is embedded into AddP node:
>      >>>>
>     https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/premain/src/hotspot/share/gc/shared/c2/cardTableBarrierSetC2.cpp*L96__;Iw!!ACWV5N9M2RV99hQ!LjgTGnmqk1Ez4pzqXuTXAbZZfHhhP9a1zAFdWGz7V59xO3ggQYVrdW3jTlSNYML_SvRN0Ip_HA-9E0w0Jbo$ <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/premain/src/hotspot/share/gc/shared/c2/cardTableBarrierSetC2.cpp*L96__;Iw!!ACWV5N9M2RV99hQ!LjgTGnmqk1Ez4pzqXuTXAbZZfHhhP9a1zAFdWGz7V59xO3ggQYVrdW3jTlSNYML_SvRN0Ip_HA-9E0w0Jbo$>
>      >>>
>      >>> I believe C2 is ok. On AArch64 the C2 back end has an operand
>     called immByteMapBase that matches any use of
>      >>> byte_map_base as a pointer constant. A corresponding rule
>     ensures that any attempt to load that constant pointer is
>      >>> done by calling
>      >>>
>      >>>    __ load_byte_map_base()
>      >>>
>      >>> and the macr assembler uses an ExternalAddress when
>     StoreCachedCode is set.
>      >>>
>      >>>> I marked it only in platform specific code:
>      >>>>
>     https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/premain/src/hotspot/cpu/x86/gc/shared/cardTableBarrierSetAssembler_x86.cpp*L68__;Iw!!ACWV5N9M2RV99hQ!LjgTGnmqk1Ez4pzqXuTXAbZZfHhhP9a1zAFdWGz7V59xO3ggQYVrdW3jTlSNYML_SvRN0Ip_HA-92hI_Afg$ <https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/premain/src/hotspot/cpu/x86/gc/shared/cardTableBarrierSetAssembler_x86.cpp*L68__;Iw!!ACWV5N9M2RV99hQ!LjgTGnmqk1Ez4pzqXuTXAbZZfHhhP9a1zAFdWGz7V59xO3ggQYVrdW3jTlSNYML_SvRN0Ip_HA-92hI_Afg$>
>      >>>
>      >>> I also think there is no mystery as to why this works on
>     AArch64. The equivalent code in
>      >>> cardTableBarrierSetAssembler_aarch64.cpp also calls
>      >>>
>      >>>    __ load_byte_map_base()
>      >>>
>      >>> However, that said, looking at the barrier card table write it
>     seems the card table shift is another value that the
>      >>> user might decide to reconfigure between assembly and
>     production runs. The shift is derived form GCCardTableSize
>      >>> which is a gc global command line option so susceptible to
>     being reset.
>      >>>
>      >>> We could use an aot_reloc with a different format to tag card
>     table shift lsr instructions and patch them at shared
>      >>> code load to use the current runtime GCCardTableSize. Of
>     course, this is not as urgent a problem as the grain size
>      >>> shift because the card table size only ever changes thanks to
>     an explicit command line request whereas the grain size
>      >>> may be changed ergonomically when -Xmx is passed.
>      >>>
>      >>>> Be careful with using `format()` in shared code because it
>     depends on platform specific setting `format_width` in
>      >>>> `relocInfo_<arch>.hpp` (32-bit arm has format_width = 0).
>      >>>
>      >>> Oh, that's a nuisance. I was concerned that I was perhaps using
>     the wrong reloc model here. Should I be encoding info
>      >>> about what/how to patch the target instruction(s) using reloc data?
>      >>>
>      >>> regards,
>      >>>
>      >>>
>      >>> Andrew Dinn
>      >>> -----------
>      >>>
>      >>
>      >
>