[External] : Re: premain: Possible solutions to use runtime G1 region grain size in AOT code (JDK-8335440)

Thu Jul 18 16:15:51 UTC 2024

On 7/18/24 4:00 AM, Andrew Dinn wrote:
> On 17/07/2024 18:27, Vladimir Kozlov wrote:
>>  > We don't have such a reloc at present..
>>
>> What about section_word_Relocation so we can put grain value into constants section?
> 
> I agree that when compiling an nmethod we would need to use a section_word_type reloc to mark the adrp that accesses the 
> constant. That would ensure that the offset used by the adrp is kept consistent across buffer resizes and at install 
> when the displacement may change.
> 
> However, what I was talking about was a new reloc, needed only when the SCCache restores code, which would mark the 
> constant itself. When AOT code is restored we need to ensure any such constant is rewritten using the runtime grain size.
> 
> We could attempt to do the rewrite of the constant as a side-effect of processing the section_word_type reloc during 
> code restore. However, we would need to know for sure that the constant being accessed by the adrp was definitely the 
> grain size. Is that what you were thinking of, Vladimir?
> 
> Of course that would not work for stubs which need to include a barrier and a reference to the barrier shift (I believe 
> this only applies for some of the memory copy stubs). In this case we would have to load the constant from a data slot 
> allocated in amongst the instructions. So, we I think would not be able to identify the location of the constant with a 
> section_word_type reloc.

Yes, you are right, section_word_type will not work.

What about allocating word in CodeCache as we do for some intrinsics stubs tables? You will need to generate it only 
once and can use runtime_type relocation to access it.

It is all about loading with existing relocation vs specialized relocation for immediate value (Option three).
I would like to see how complex option three is.

Thanks,
Vladimir K

> 
> regards,
> 
> 
> Andrew Dinn
> -----------
> 
>> On 7/17/24 3:15 AM, Andrew Dinn wrote:
>>> Hi Ioi,
>>>
>>> On 16/07/2024 17:33, ioi.lam at oracle.com wrote:
>>>>
>>>> On 7/15/24 9:23 AM, Andrew Dinn wrote:
>>>>> . . .
>>>>> The second solution modifies barrier generation when the SCCache is open for writing to load the shift count from a 
>>>>> runtime field, G1HeapRegion::LogHRGrainSize i.e. the same field that determines the immediate count used for normal 
>>>>> generation. In order to make this visible to the compilers and SCC address table the address of this field is 
>>>>> exported via the card table. This solution requires the AOT code to reference the target address using a runtime 
>>>>> address relocation. Once again, if the SCCache is not open for writing the count is generated as normal i.e. as an 
>>>>> immediate operand.
>>>>>
>>>>>
>>>> Is the G1HeapRegion::LogHRGrainSize loaded with PC offset?
>>>>
>>>>      ldr grain, [pc, #5678]
>>>
>>> That's not what this option does. The barroer loads the grain size indirectly via a constant static field address, 
>>> i.e. via address &G1HeapRegion::LogHRGrainSize (well, actually, the constant is determined by whatever address is 
>>> reported by the barrier card table but effectively it is &G1HeapRegion::LogHRGrainSize). So the barrier includes uses 
>>> a sequence like this
>>>
>>>    movz reg #<16bit>
>>>    movk reg #<16bit>, #16
>>>    movk reg #<16bit>, #32
>>>    ldrb reg, reg
>>>    . . .
>>>    lsr reg2, reg, reg2
>>>
>>> The 16 bit quantities compose to the address of the field. The 3 x mov sequence is marked with a runtime relocation 
>>> which ensures that it is updated when generated code is restored from the SCCache. That requires the field address to 
>>> be inserted in the SCC address table's list of external addresses.
>>>
>>> This scheme requires repeating that series of 3 x mov + ldrb instructions at every object field store in a given 
>>> compiled method. That also implies a runtime relocation for each such sequence when the code is restored from the 
>>> SCCache.
>>>
>>> With C2 the barrier manifests as a (Set dst con) for a special ConP value (operand con has type immRegionGrainShift) 
>>> feeding a LoadB. I guess C2 might conceivably be able to optimize away some of the repeat movz/k and ldrb sequences 
>>> if it is able to keep the address or byte value in a register or spill slot but I would not expect that to be likely.
>>>
>>>> I suppose this require us to put multiple copies of G1HeapRegion::LogHRGrainSize inside the AOT code, as there's a 
>>>> limit for the offset. But we will be patching fewer places than every sites that needs to know the grain size.
>>> I think what you are suggesting here is what I described as option 4. i.e. we put the grain size in the nmethod const 
>>> section (or in a dedicated data location for a non-nmethod blob) and insert a pc-relative load in the barrier to feed 
>>> the lsr.
>>>
>>> With AOT code this would require a special relocation to mark the constants area slot (or the non-method blob data 
>>> slot), lets call it reloc_grain_shift_const. It would patch the constant to whatever value field 
>>> G1HeapRegion::LogHRGrainSize has in the current runtime (or rather to whatever grain size is reported by the barrier 
>>> card table). We don't have such a reloc at present.. We do have an existing reloc for a runtime data address which is 
>>> why I implemented option 2 first (to work out where I would need to tweak the compilers and barrier set assemblers 
>>> plus auxiliary classes).
>>>
>>> With option 4 I believe we will only need one occurrence of the constant. On AArch64 we would use either adr or adrp 
>>> + add to install a pc-relative address into a register and then an ldrb via that register.
>>>
>>>    adr reg, #<21bits>
>>>    ldrb reg, reg
>>>    ...
>>>    lsr reg2, reg, reg2
>>>
>>> or
>>>
>>>    adrp reg, #<21bits> # selects 12 bit-aligned page
>>>    add  reg, #<12bits>
>>>    ldrb reg, reg
>>>    ...
>>>    lsr reg2, reg, reg2
>>>
>>> The adr/adrp instructions do not need relocating which is why scheme 4 would only require 1 relocation per nmethod 
>>> (or non-nmethod blob).
>>>
>>> Option 3 involves generating the normal barrier
>>>
>>>      lsr, reg, #imm, reg
>>>
>>> The difference is that for AOT code we would mark the instruction with a new relocation, let's call it 
>>> reloc_grain_shift_immediate. Patching for this reloc would assert that the corresponding instruction is an shift and 
>>> that the current GC barrier set is using a card table. It would update the immediate operand with whatever grain size 
>>> shift was reported by the card table.
>>>
>>> Like scheme 2 this would require a reloc for every object field write in an nmethod (or non-nmethod blob).
>>>
>>> regards,
>>>
>>>
>>> Andrew Dinn
>>> -----------
>>>
>>
>