premain: Possible solutions to use runtime G1 region grain size in AOT code (JDK-8335440)

Wed Jul 17 10:15:04 UTC 2024

Hi Ioi,

On 16/07/2024 17:33, ioi.lam at oracle.com wrote:
> 
> On 7/15/24 9:23 AM, Andrew Dinn wrote:
>> . . .
>> The second solution modifies barrier generation when the SCCache is 
>> open for writing to load the shift count from a runtime field, 
>> G1HeapRegion::LogHRGrainSize i.e. the same field that determines the 
>> immediate count used for normal generation. In order to make this 
>> visible to the compilers and SCC address table the address of this 
>> field is exported via the card table. This solution requires the AOT 
>> code to reference the target address using a runtime address 
>> relocation. Once again, if the SCCache is not open for writing the 
>> count is generated as normal i.e. as an immediate operand.
>>
>>
> Is the G1HeapRegion::LogHRGrainSize loaded with PC offset?
> 
>      ldr grain, [pc, #5678]

That's not what this option does. The barroer loads the grain size 
indirectly via a constant static field address, i.e. via address 
&G1HeapRegion::LogHRGrainSize (well, actually, the constant is 
determined by whatever address is reported by the barrier card table but 
effectively it is &G1HeapRegion::LogHRGrainSize). So the barrier 
includes uses a sequence like this

   movz reg #<16bit>
   movk reg #<16bit>, #16
   movk reg #<16bit>, #32
   ldrb reg, reg
   . . .
   lsr reg2, reg, reg2

The 16 bit quantities compose to the address of the field. The 3 x mov 
sequence is marked with a runtime relocation which ensures that it is 
updated when generated code is restored from the SCCache. That requires 
the field address to be inserted in the SCC address table's list of 
external addresses.

This scheme requires repeating that series of 3 x mov + ldrb 
instructions at every object field store in a given compiled method. 
That also implies a runtime relocation for each such sequence when the 
code is restored from the SCCache.

With C2 the barrier manifests as a (Set dst con) for a special ConP 
value (operand con has type immRegionGrainShift) feeding a LoadB. I 
guess C2 might conceivably be able to optimize away some of the repeat 
movz/k and ldrb sequences if it is able to keep the address or byte 
value in a register or spill slot but I would not expect that to be likely.

> I suppose this require us to put multiple copies of 
> G1HeapRegion::LogHRGrainSize inside the AOT code, as there's a limit for 
> the offset. But we will be patching fewer places than every sites that 
> needs to know the grain size.
I think what you are suggesting here is what I described as option 4. 
i.e. we put the grain size in the nmethod const section (or in a 
dedicated data location for a non-nmethod blob) and insert a pc-relative 
load in the barrier to feed the lsr.

With AOT code this would require a special relocation to mark the 
constants area slot (or the non-method blob data slot), lets call it 
reloc_grain_shift_const. It would patch the constant to whatever value 
field G1HeapRegion::LogHRGrainSize has in the current runtime (or rather 
to whatever grain size is reported by the barrier card table). We don't 
have such a reloc at present.. We do have an existing reloc for a 
runtime data address which is why I implemented option 2 first (to work 
out where I would need to tweak the compilers and barrier set assemblers 
plus auxiliary classes).

With option 4 I believe we will only need one occurrence of the 
constant. On AArch64 we would use either adr or adrp + add to install a 
pc-relative address into a register and then an ldrb via that register.

   adr reg, #<21bits>
   ldrb reg, reg
   ...
   lsr reg2, reg, reg2

or

   adrp reg, #<21bits> # selects 12 bit-aligned page
   add  reg, #<12bits>
   ldrb reg, reg
   ...
   lsr reg2, reg, reg2

The adr/adrp instructions do not need relocating which is why scheme 4 
would only require 1 relocation per nmethod (or non-nmethod blob).

Option 3 involves generating the normal barrier

     lsr, reg, #imm, reg

The difference is that for AOT code we would mark the instruction with a 
new relocation, let's call it reloc_grain_shift_immediate. Patching for 
this reloc would assert that the corresponding instruction is an shift 
and that the current GC barrier set is using a card table. It would 
update the immediate operand with whatever grain size shift was reported 
by the card table.

Like scheme 2 this would require a reloc for every object field write in 
an nmethod (or non-nmethod blob).

regards,

Andrew Dinn
-----------