premain: Possible solutions to use runtime G1 region grain size in AOT code (JDK-8335440)
Andrew Dinn
adinn at redhat.com
Wed Jul 17 10:15:04 UTC 2024
Hi Ioi,
On 16/07/2024 17:33, ioi.lam at oracle.com wrote:
>
> On 7/15/24 9:23 AM, Andrew Dinn wrote:
>> . . .
>> The second solution modifies barrier generation when the SCCache is
>> open for writing to load the shift count from a runtime field,
>> G1HeapRegion::LogHRGrainSize i.e. the same field that determines the
>> immediate count used for normal generation. In order to make this
>> visible to the compilers and SCC address table the address of this
>> field is exported via the card table. This solution requires the AOT
>> code to reference the target address using a runtime address
>> relocation. Once again, if the SCCache is not open for writing the
>> count is generated as normal i.e. as an immediate operand.
>>
>>
> Is the G1HeapRegion::LogHRGrainSize loaded with PC offset?
>
> ldr grain, [pc, #5678]
That's not what this option does. The barroer loads the grain size
indirectly via a constant static field address, i.e. via address
&G1HeapRegion::LogHRGrainSize (well, actually, the constant is
determined by whatever address is reported by the barrier card table but
effectively it is &G1HeapRegion::LogHRGrainSize). So the barrier
includes uses a sequence like this
movz reg #<16bit>
movk reg #<16bit>, #16
movk reg #<16bit>, #32
ldrb reg, reg
. . .
lsr reg2, reg, reg2
The 16 bit quantities compose to the address of the field. The 3 x mov
sequence is marked with a runtime relocation which ensures that it is
updated when generated code is restored from the SCCache. That requires
the field address to be inserted in the SCC address table's list of
external addresses.
This scheme requires repeating that series of 3 x mov + ldrb
instructions at every object field store in a given compiled method.
That also implies a runtime relocation for each such sequence when the
code is restored from the SCCache.
With C2 the barrier manifests as a (Set dst con) for a special ConP
value (operand con has type immRegionGrainShift) feeding a LoadB. I
guess C2 might conceivably be able to optimize away some of the repeat
movz/k and ldrb sequences if it is able to keep the address or byte
value in a register or spill slot but I would not expect that to be likely.
> I suppose this require us to put multiple copies of
> G1HeapRegion::LogHRGrainSize inside the AOT code, as there's a limit for
> the offset. But we will be patching fewer places than every sites that
> needs to know the grain size.
I think what you are suggesting here is what I described as option 4.
i.e. we put the grain size in the nmethod const section (or in a
dedicated data location for a non-nmethod blob) and insert a pc-relative
load in the barrier to feed the lsr.
With AOT code this would require a special relocation to mark the
constants area slot (or the non-method blob data slot), lets call it
reloc_grain_shift_const. It would patch the constant to whatever value
field G1HeapRegion::LogHRGrainSize has in the current runtime (or rather
to whatever grain size is reported by the barrier card table). We don't
have such a reloc at present.. We do have an existing reloc for a
runtime data address which is why I implemented option 2 first (to work
out where I would need to tweak the compilers and barrier set assemblers
plus auxiliary classes).
With option 4 I believe we will only need one occurrence of the
constant. On AArch64 we would use either adr or adrp + add to install a
pc-relative address into a register and then an ldrb via that register.
adr reg, #<21bits>
ldrb reg, reg
...
lsr reg2, reg, reg2
or
adrp reg, #<21bits> # selects 12 bit-aligned page
add reg, #<12bits>
ldrb reg, reg
...
lsr reg2, reg, reg2
The adr/adrp instructions do not need relocating which is why scheme 4
would only require 1 relocation per nmethod (or non-nmethod blob).
Option 3 involves generating the normal barrier
lsr, reg, #imm, reg
The difference is that for AOT code we would mark the instruction with a
new relocation, let's call it reloc_grain_shift_immediate. Patching for
this reloc would assert that the corresponding instruction is an shift
and that the current GC barrier set is using a card table. It would
update the immediate operand with whatever grain size shift was reported
by the card table.
Like scheme 2 this would require a reloc for every object field write in
an nmethod (or non-nmethod blob).
regards,
Andrew Dinn
-----------
More information about the leyden-dev
mailing list