premain: Possible solutions to use runtime G1 region grain size in AOT code (JDK-8335440)

Tue Jul 16 16:33:56 UTC 2024

On 7/15/24 9:23 AM, Andrew Dinn wrote:
> I have prototyped two (aarch64-specific) solutions for JDK-8335440 
> both of which fix the G1 write barrier in AOT code to use the runtime 
> region grain size. Both solutions make AOT code resilient to any 
> change in max heap between assembly and production runs.
>
> The problem arises because ergonomics uses the heap size to derive a 
> G1 region size and the latter size determines what shift is needed to 
> convert a store address to a card table index. In currently generated 
> nmethod and *stub* code) the shift count is installed as an immediate 
> operand of a generated shift instruction. In AOT code the shift counts 
> needs to be appropriate to the current runtime region size. AOT code 
> can resolve this requirement in two ways. It can load the shift from a 
> well known location and supply the shift count as a register operand. 
> Alternatively, it can employ load-time rewriting of the instruction 
> stream to update the immediate operand.
>
> Both current solutions rely on loading rather than instruction 
> rewriting. The first solution installs the shift count in a (byte) 
> field added to every Java thread. It modifies barrier generation when 
> the SCCache is open for writing to load the shift count from the 
> thread field. This solution requires no relocation when the AOT 
> stub/nmethod is loaded from the cache since the load is always at a 
> fixed offset from the thread register. If the SCCache is not open for 
> writing the count is generated as normal i.e. as an immediate operand.
>
>
> https://github.com/adinn/leyden/compare/premain...adinn:leyden:JDK-8335440-load-via-thread 
>
>
> The second solution modifies barrier generation when the SCCache is 
> open for writing to load the shift count from a runtime field, 
> G1HeapRegion::LogHRGrainSize i.e. the same field that determines the 
> immediate count used for normal generation. In order to make this 
> visible to the compilers and SCC address table the address of this 
> field is exported via the card table. This solution requires the AOT 
> code to reference the target address using a runtime address 
> relocation. Once again, if the SCCache is not open for writing the 
> count is generated as normal i.e. as an immediate operand.
>
>
Is the G1HeapRegion::LogHRGrainSize loaded with PC offset?

     ldr grain, [pc, #5678]

I suppose this require us to put multiple copies of 
G1HeapRegion::LogHRGrainSize inside the AOT code, as there's a limit for 
the offset. But we will be patching fewer places than every sites that 
needs to know the grain size.

Thanks

- Ioi

> https://github.com/adinn/leyden/compare/premain...adinn:leyden:JDK-8335440-load-via-memory 
>
>
> I ran the javac benchmark with each of these solutions compared 
> against the equivalent premain build. Neither indicated any noticeable 
> change in execution time -- at least not within the margins of error 
> of the test runs (which, on my M2 machine were +/- 5 msecs in a 95 
> msec run). A better test might be to take a long running app and see 
> if the change to the AOT barrier code introduced any change in overall 
> execution time.
>
> I implemented these two solutions first because neither of them 
> requires implementing any new relocations. There are two alternatives 
> which would require new relocations that may still be worth 
> investigating. Option three is to mark the shift instruction with a 
> new relocation. Patching of the relocation address would simply 
> require regenerating it with an immediate that matches (log2 of the) 
> current region size.
>
> The fourth option is to load the shift count from a data area 
> associated with the current blob. In the case of an nmethod this would 
> be the nmethod constants section. In the case of a generated stub this 
> would have to be a dedicated memory address in its associated blob. 
> Either way the data location would need to be marked with a new 
> relocation. Patching of the relocation address would simply require 
> copying the (log2 of the) current region size ito the data area.
>
> I'll hold off on adding these solutions (also on implementing the x86 
> versions -- well, more likely, letting Ashu provide them ;-) until I 
> get some feedback on these first two. I'll also see if I can get any 
> better indication of whether the performance of the first two 
> solutions is an issue. I think solution one is by far the simplest, 
> resolving the immediate issue with least fuss (note that I poked the 
> necessary data byte into a hole in the thread record so it has no 
> space implications). However, if we end up having to tweak generated 
> code to deal with other config changes the alternatives may be worth 
> investing in as they might scale better.
>
> regards,
>
>
> Andrew Dinn
> -----------
>