[External] : Re: premain: Possible solutions to use runtime G1 region grain size in AOT code (JDK-8335440)

Wed Aug 14 19:19:30 UTC 2024

Hi Vladimir,

I have done a port of Andrew D's patch for version 4 of the solution to
x8-64.
Here is the link to the change sets on top of Andrew's original patch for
v4.
https://github.com/openjdk/leyden/compare/premain...ashu-mehra:leyden:JDK-8335440-load-via-codecache?expand=1

Can you please review the patch? Any feedback is appreciated.

Few things worth calling out:
1. As Andrew D mentioned in his previous mail, C2 coalesces a pointer
constant load followed by an add to an updated
pointer constant load. And because C2 on x86 generates external_word reloc
for every constant pointer loaded,
it means we would have to add every address in AOTRuntimeConstants range to
the SCAddressTable,
To avoid this, I used the same mechanism as aarch64 to identify the
AOTRuntimeConstants addresses in the back-end.

2. C1 doesn't support use of 3-operand shift instruction (sarx/shlx/shrx).
I extended C1 to lower shift ops using these instructions if possible. Does
this make sense?

3. In G1BarrierSetAssembler::g1_write_barrier_post() the tmp2 register is
not available for use.
The caller G1BarrierSetAssembler::oop_store_at uses tmp2 for storing
uncompressed oop.
Also, rscratch2 is not available as it is passed as tmp1 which is also in
use.
So we don't really have any free registers left.
In my patch I resorted to using rscratch1. Is there a need to spill that
register to stack before using it?

Thanks,
- Ashutosh Mehra

On Wed, Jul 31, 2024 at 4:08 PM Vladimir Kozlov <vladimir.kozlov at oracle.com>
wrote:

> Very nice. I like it.
>
> Thanks,
> Vladimir K
>
> On 7/31/24 8:23 AM, Andrew Dinn wrote:
> > I now have a fourth solution to the region resizing problem which I
> think is probably the nicest looking one so far:
> >
> >
> >
> https://urldefense.com/v3/__https://github.com/adinn/leyden/compare/premain...adinn:leyden:JDK-8335440-load-via-codecache?expand=1__;!!ACWV5N9M2RV99hQ!JI8n2MMyF844NBqR63MhWXggKzPCoouZ8pcwgasAa94iYPRzpcD7af-zSicg5Wm0nv9ZyaF1CpaC74Mqn3g$
> > Opinions welcome!
> >
> > This version relies on a code cache AOT data area, generated into the
> initial stubs blob, which stores runtime constants
> > that parameterize operation of AOT code. The AOT data area is
> initialized with the current runtime constants immediately
> > after universe init. AOT code loads the constants indirectly via the
> region base address rather than embedding them into
> > instructions. Non-AOT code can continue to use these constants as
> immediate values.
> >
> > With this patch I have included the region grain size shift and the card
> table shift in the data area. So, this patch
> > ensures that AOT code loads both grain shift and card shift counts
> rather than using immediates. I tested it passing
> > -XX:GCCardSizeInBytes=128 -Xmx4M to a production run and it works fine.
> >
> > The only address that needs relocating wit this patch is the AOT data
> area base. Note that it was useful using this to
> > implement loading of two AOT code parameters as it highlights a general
> problem that is not face with only one field.
> >
> > I decide to register only the base address of the AOT data area with the
> code cache as a relocatable (external) address
> > because this is more robust than requiring registration of the address
> of every data field embedded in the AOT data
> > area. However, in order for the AOT cache to be able to relocate loads
> from the data area they need to be implemented by
> > installing the base address in a register and then adding the offset
> before executing the load.
> >
> > This is fine for stub code where the generator can simply plant a load
> effective address (lea) of a pointer constant
> > into a register and then add in the offset. However, I know C2 Value
> optimizations will certainly coalesce a pointer
> > constant load followed by an add to an updated pointer constant load. I
> think this can also happen with C1 (it doesn't
> > at the moment because in the current C1 code the grain shift load -- at
> offset 0 -- happens inline, while the card shift
> > load -- at offset 4 -- only happens in a callout to stub code). I fixed
> this general problem by ensuring that the C1 and
> > C2 back ends detect pointer constants that lie anywhere in the data area
> address *range*. For (only) those cases the lea
> > is implemenetd by generating a base address load plus offset add.
> >
> > It would be nice to be ale to fold the offset into the subsequent load.
> I'll look into that as a follow-up. Either way,
> > I think this looks like the simplest and most robust solution to
> ensuring that AOT code is correctly parameterized at
> > runtime. It is particularly nice that it doesn't need any new
> relocations.
> >
> > regards,
> >
> >
> > Andrew Dinn
> > -----------
> >
> > On 24/07/2024 20:43, Vladimir Kozlov wrote:
> >> Thank you for checking code for card_table_address.
> >>
> >>  > Oh, that's a nuisance. I was concerned that I was perhaps using the
> wrong reloc model here. Should I be encoding info
> >>  > about what/how to patch the target instruction(s) using reloc data?
> >>
> >> I am currently thinking about adding indication what relocation is
> pointing on: blob, stubs, external address, string.
> >> Currently we are looking through all our address tables in SCCache to
> find match. Would be nice if we can get this
> >> information from relocation. And I also thought about using format for
> it but realized it is PD and it reserves bits
> >> in RelocInfo word. On other hand we can add an other 12 (31 - 18 - 1)
> new relocations without using additional bits.
> >>
> >> In short add new relocation if you need it.
> >>
> >> Thanks,
> >> Vladimir K
> >>
> >> On 7/24/24 7:40 AM, Andrew Dinn wrote:
> >>> On 23/07/2024 18:44, Vladimir Kozlov wrote:
> >>>> I agree. As we discussed on meeting we may use it to patch compressed
> oops/klass code too.
> >>>
> >>> Yes, we could potentially do something similar to adjust the shift and
> base adjustment employed when generating a
> >>> narrow oop or klass encode or decode. However, that would only be
> useful with certain limits.
> >>>
> >>> 1) We can only cater for a change from one narrow-oop/klass base+shift
> configuration to another base+shift
> >>> configuration. Patching code to switch from narrow oops/klass pointers
> to full width (or vice versa) would require
> >>> adjusting not just the oop/klass load/store but also other
> instructions that hard-wire header/payload field sizes and
> >>> offsets based on the narrow/full width layout assumptions. Likewise it
> would require resizing and repacking objects
> >>> stored in the CDS mapped heap section.
> >>>
> >>> 2) Even if we restrict ourselves to the case where we retain narrow
> oops and simply allow for a change of base and/or
> >>> shift (i.e. keep the object layouts the same), that will only work if
> pointers installed in mapped CDS heap objects
> >>> are also rewritten to use the runtime encoding.
> >>>
> >>> 3) When generating AOT code for the encode and decode we would have to
> allow for the worst case i.e. leave room
> >>> (nops) for both heap-base adjustment and shift even when they are not
> needed at generate time.
> >>>
> >>>> We have the same issue with card_table_address. I actually don't know
> why current AOT code works??? May be because
> >>>> it is the same value in training and production runs. But we need to
> fix it too.
> >>>
> >>>> It is constant node which is embedded into AddP node:
> >>>>
> https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/premain/src/hotspot/share/gc/shared/c2/cardTableBarrierSetC2.cpp*L96__;Iw!!ACWV5N9M2RV99hQ!LjgTGnmqk1Ez4pzqXuTXAbZZfHhhP9a1zAFdWGz7V59xO3ggQYVrdW3jTlSNYML_SvRN0Ip_HA-9E0w0Jbo$
> >>>
> >>> I believe C2 is ok. On AArch64 the C2 back end has an operand called
> immByteMapBase that matches any use of
> >>> byte_map_base as a pointer constant. A corresponding rule ensures that
> any attempt to load that constant pointer is
> >>> done by calling
> >>>
> >>>    __ load_byte_map_base()
> >>>
> >>> and the macr assembler uses an ExternalAddress when StoreCachedCode is
> set.
> >>>
> >>>> I marked it only in platform specific code:
> >>>>
> https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/premain/src/hotspot/cpu/x86/gc/shared/cardTableBarrierSetAssembler_x86.cpp*L68__;Iw!!ACWV5N9M2RV99hQ!LjgTGnmqk1Ez4pzqXuTXAbZZfHhhP9a1zAFdWGz7V59xO3ggQYVrdW3jTlSNYML_SvRN0Ip_HA-92hI_Afg$
> >>>
> >>> I also think there is no mystery as to why this works on AArch64. The
> equivalent code in
> >>> cardTableBarrierSetAssembler_aarch64.cpp also calls
> >>>
> >>>    __ load_byte_map_base()
> >>>
> >>> However, that said, looking at the barrier card table write it seems
> the card table shift is another value that the
> >>> user might decide to reconfigure between assembly and production runs.
> The shift is derived form GCCardTableSize
> >>> which is a gc global command line option so susceptible to being reset.
> >>>
> >>> We could use an aot_reloc with a different format to tag card table
> shift lsr instructions and patch them at shared
> >>> code load to use the current runtime GCCardTableSize. Of course, this
> is not as urgent a problem as the grain size
> >>> shift because the card table size only ever changes thanks to an
> explicit command line request whereas the grain size
> >>> may be changed ergonomically when -Xmx is passed.
> >>>
> >>>> Be careful with using `format()` in shared code because it depends on
> platform specific setting `format_width` in
> >>>> `relocInfo_<arch>.hpp` (32-bit arm has format_width = 0).
> >>>
> >>> Oh, that's a nuisance. I was concerned that I was perhaps using the
> wrong reloc model here. Should I be encoding info
> >>> about what/how to patch the target instruction(s) using reloc data?
> >>>
> >>> regards,
> >>>
> >>>
> >>> Andrew Dinn
> >>> -----------
> >>>
> >>
> >
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/leyden-dev/attachments/20240814/a605ba06/attachment.htm>