[External] : Re: premain: Possible solutions to use runtime G1 region grain size in AOT code (JDK-8335440)

Wed Jul 31 15:23:11 UTC 2024

I now have a fourth solution to the region resizing problem which I 
think is probably the nicest looking one so far:

https://github.com/adinn/leyden/compare/premain...adinn:leyden:JDK-8335440-load-via-codecache?expand=1

Opinions welcome!

This version relies on a code cache AOT data area, generated into the 
initial stubs blob, which stores runtime constants that parameterize 
operation of AOT code. The AOT data area is initialized with the current 
runtime constants immediately after universe init. AOT code loads the 
constants indirectly via the region base address rather than embedding 
them into instructions. Non-AOT code can continue to use these constants 
as immediate values.

With this patch I have included the region grain size shift and the card 
table shift in the data area. So, this patch ensures that AOT code loads 
both grain shift and card shift counts rather than using immediates. I 
tested it passing -XX:GCCardSizeInBytes=128 -Xmx4M to a production run 
and it works fine.

The only address that needs relocating wit this patch is the AOT data 
area base. Note that it was useful using this to implement loading of 
two AOT code parameters as it highlights a general problem that is not 
face with only one field.

I decide to register only the base address of the AOT data area with the 
code cache as a relocatable (external) address because this is more 
robust than requiring registration of the address of every data field 
embedded in the AOT data area. However, in order for the AOT cache to be 
able to relocate loads from the data area they need to be implemented by 
installing the base address in a register and then adding the offset 
before executing the load.

This is fine for stub code where the generator can simply plant a load 
effective address (lea) of a pointer constant into a register and then 
add in the offset. However, I know C2 Value optimizations will certainly 
coalesce a pointer constant load followed by an add to an updated 
pointer constant load. I think this can also happen with C1 (it doesn't 
at the moment because in the current C1 code the grain shift load -- at 
offset 0 -- happens inline, while the card shift load -- at offset 4 -- 
only happens in a callout to stub code). I fixed this general problem by 
ensuring that the C1 and C2 back ends detect pointer constants that lie 
anywhere in the data area address *range*. For (only) those cases the 
lea is implemenetd by generating a base address load plus offset add.

It would be nice to be ale to fold the offset into the subsequent load. 
I'll look into that as a follow-up. Either way, I think this looks like 
the simplest and most robust solution to ensuring that AOT code is 
correctly parameterized at runtime. It is particularly nice that it 
doesn't need any new relocations.

regards,

Andrew Dinn
-----------

On 24/07/2024 20:43, Vladimir Kozlov wrote:
> Thank you for checking code for card_table_address.
> 
>  > Oh, that's a nuisance. I was concerned that I was perhaps using the 
> wrong reloc model here. Should I be encoding info
>  > about what/how to patch the target instruction(s) using reloc data?
> 
> I am currently thinking about adding indication what relocation is 
> pointing on: blob, stubs, external address, string.
> Currently we are looking through all our address tables in SCCache to 
> find match. Would be nice if we can get this information from 
> relocation. And I also thought about using format for it but realized it 
> is PD and it reserves bits in RelocInfo word. On other hand we can add 
> an other 12 (31 - 18 - 1) new relocations without using additional bits.
> 
> In short add new relocation if you need it.
> 
> Thanks,
> Vladimir K
> 
> On 7/24/24 7:40 AM, Andrew Dinn wrote:
>> On 23/07/2024 18:44, Vladimir Kozlov wrote:
>>> I agree. As we discussed on meeting we may use it to patch compressed 
>>> oops/klass code too.
>>
>> Yes, we could potentially do something similar to adjust the shift and 
>> base adjustment employed when generating a narrow oop or klass encode 
>> or decode. However, that would only be useful with certain limits.
>>
>> 1) We can only cater for a change from one narrow-oop/klass base+shift 
>> configuration to another base+shift configuration. Patching code to 
>> switch from narrow oops/klass pointers to full width (or vice versa) 
>> would require adjusting not just the oop/klass load/store but also 
>> other instructions that hard-wire header/payload field sizes and 
>> offsets based on the narrow/full width layout assumptions. Likewise it 
>> would require resizing and repacking objects stored in the CDS mapped 
>> heap section.
>>
>> 2) Even if we restrict ourselves to the case where we retain narrow 
>> oops and simply allow for a change of base and/or shift (i.e. keep the 
>> object layouts the same), that will only work if pointers installed in 
>> mapped CDS heap objects are also rewritten to use the runtime encoding.
>>
>> 3) When generating AOT code for the encode and decode we would have to 
>> allow for the worst case i.e. leave room (nops) for both heap-base 
>> adjustment and shift even when they are not needed at generate time.
>>
>>> We have the same issue with card_table_address. I actually don't know 
>>> why current AOT code works??? May be because it is the same value in 
>>> training and production runs. But we need to fix it too.
>>
>>> It is constant node which is embedded into AddP node:
>>> https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/premain/src/hotspot/share/gc/shared/c2/cardTableBarrierSetC2.cpp*L96__;Iw!!ACWV5N9M2RV99hQ!LjgTGnmqk1Ez4pzqXuTXAbZZfHhhP9a1zAFdWGz7V59xO3ggQYVrdW3jTlSNYML_SvRN0Ip_HA-9E0w0Jbo$ 
>>
>> I believe C2 is ok. On AArch64 the C2 back end has an operand called 
>> immByteMapBase that matches any use of byte_map_base as a pointer 
>> constant. A corresponding rule ensures that any attempt to load that 
>> constant pointer is done by calling
>>
>>    __ load_byte_map_base()
>>
>> and the macr assembler uses an ExternalAddress when StoreCachedCode is 
>> set.
>>
>>> I marked it only in platform specific code:
>>> https://urldefense.com/v3/__https://github.com/openjdk/leyden/blob/premain/src/hotspot/cpu/x86/gc/shared/cardTableBarrierSetAssembler_x86.cpp*L68__;Iw!!ACWV5N9M2RV99hQ!LjgTGnmqk1Ez4pzqXuTXAbZZfHhhP9a1zAFdWGz7V59xO3ggQYVrdW3jTlSNYML_SvRN0Ip_HA-92hI_Afg$ 
>>
>> I also think there is no mystery as to why this works on AArch64. The 
>> equivalent code in cardTableBarrierSetAssembler_aarch64.cpp also calls
>>
>>    __ load_byte_map_base()
>>
>> However, that said, looking at the barrier card table write it seems 
>> the card table shift is another value that the user might decide to 
>> reconfigure between assembly and production runs. The shift is derived 
>> form GCCardTableSize which is a gc global command line option so 
>> susceptible to being reset.
>>
>> We could use an aot_reloc with a different format to tag card table 
>> shift lsr instructions and patch them at shared code load to use the 
>> current runtime GCCardTableSize. Of course, this is not as urgent a 
>> problem as the grain size shift because the card table size only ever 
>> changes thanks to an explicit command line request whereas the grain 
>> size may be changed ergonomically when -Xmx is passed.
>>
>>> Be careful with using `format()` in shared code because it depends on 
>>> platform specific setting `format_width` in `relocInfo_<arch>.hpp` 
>>> (32-bit arm has format_width = 0).
>>
>> Oh, that's a nuisance. I was concerned that I was perhaps using the 
>> wrong reloc model here. Should I be encoding info about what/how to 
>> patch the target instruction(s) using reloc data?
>>
>> regards,
>>
>>
>> Andrew Dinn
>> -----------
>>
> 

-- 
regards,

Andrew Dinn
-----------
Red Hat Distinguished Engineer
Red Hat UK Ltd
Registered in England and Wales under Company Registration No. 03798903
Directors: Michael Cunningham, Michael ("Mike") O'Neill