premain: Possible solutions to use runtime G1 region grain size in AOT code (JDK-8335440)

Wed Jul 17 17:27:00 UTC 2024

 > We don't have such a reloc at present..

What about section_word_Relocation so we can put grain value into constants section?

Thanks,
Vladimir K

On 7/17/24 3:15 AM, Andrew Dinn wrote:
> Hi Ioi,
> 
> On 16/07/2024 17:33, ioi.lam at oracle.com wrote:
>>
>> On 7/15/24 9:23 AM, Andrew Dinn wrote:
>>> . . .
>>> The second solution modifies barrier generation when the SCCache is open for writing to load the shift count from a 
>>> runtime field, G1HeapRegion::LogHRGrainSize i.e. the same field that determines the immediate count used for normal 
>>> generation. In order to make this visible to the compilers and SCC address table the address of this field is 
>>> exported via the card table. This solution requires the AOT code to reference the target address using a runtime 
>>> address relocation. Once again, if the SCCache is not open for writing the count is generated as normal i.e. as an 
>>> immediate operand.
>>>
>>>
>> Is the G1HeapRegion::LogHRGrainSize loaded with PC offset?
>>
>>      ldr grain, [pc, #5678]
> 
> That's not what this option does. The barroer loads the grain size indirectly via a constant static field address, i.e. 
> via address &G1HeapRegion::LogHRGrainSize (well, actually, the constant is determined by whatever address is reported by 
> the barrier card table but effectively it is &G1HeapRegion::LogHRGrainSize). So the barrier includes uses a sequence 
> like this
> 
>    movz reg #<16bit>
>    movk reg #<16bit>, #16
>    movk reg #<16bit>, #32
>    ldrb reg, reg
>    . . .
>    lsr reg2, reg, reg2
> 
> The 16 bit quantities compose to the address of the field. The 3 x mov sequence is marked with a runtime relocation 
> which ensures that it is updated when generated code is restored from the SCCache. That requires the field address to be 
> inserted in the SCC address table's list of external addresses.
> 
> This scheme requires repeating that series of 3 x mov + ldrb instructions at every object field store in a given 
> compiled method. That also implies a runtime relocation for each such sequence when the code is restored from the SCCache.
> 
> With C2 the barrier manifests as a (Set dst con) for a special ConP value (operand con has type immRegionGrainShift) 
> feeding a LoadB. I guess C2 might conceivably be able to optimize away some of the repeat movz/k and ldrb sequences if 
> it is able to keep the address or byte value in a register or spill slot but I would not expect that to be likely.
> 
>> I suppose this require us to put multiple copies of G1HeapRegion::LogHRGrainSize inside the AOT code, as there's a 
>> limit for the offset. But we will be patching fewer places than every sites that needs to know the grain size.
> I think what you are suggesting here is what I described as option 4. i.e. we put the grain size in the nmethod const 
> section (or in a dedicated data location for a non-nmethod blob) and insert a pc-relative load in the barrier to feed 
> the lsr.
> 
> With AOT code this would require a special relocation to mark the constants area slot (or the non-method blob data 
> slot), lets call it reloc_grain_shift_const. It would patch the constant to whatever value field 
> G1HeapRegion::LogHRGrainSize has in the current runtime (or rather to whatever grain size is reported by the barrier 
> card table). We don't have such a reloc at present.. We do have an existing reloc for a runtime data address which is 
> why I implemented option 2 first (to work out where I would need to tweak the compilers and barrier set assemblers plus 
> auxiliary classes).
> 
> With option 4 I believe we will only need one occurrence of the constant. On AArch64 we would use either adr or adrp + 
> add to install a pc-relative address into a register and then an ldrb via that register.
> 
>    adr reg, #<21bits>
>    ldrb reg, reg
>    ...
>    lsr reg2, reg, reg2
> 
> or
> 
>    adrp reg, #<21bits> # selects 12 bit-aligned page
>    add  reg, #<12bits>
>    ldrb reg, reg
>    ...
>    lsr reg2, reg, reg2
> 
> The adr/adrp instructions do not need relocating which is why scheme 4 would only require 1 relocation per nmethod (or 
> non-nmethod blob).
> 
> Option 3 involves generating the normal barrier
> 
>      lsr, reg, #imm, reg
> 
> The difference is that for AOT code we would mark the instruction with a new relocation, let's call it 
> reloc_grain_shift_immediate. Patching for this reloc would assert that the corresponding instruction is an shift and 
> that the current GC barrier set is using a card table. It would update the immediate operand with whatever grain size 
> shift was reported by the card table.
> 
> Like scheme 2 this would require a reloc for every object field write in an nmethod (or non-nmethod blob).
> 
> regards,
> 
> 
> Andrew Dinn
> -----------
>