premain: Possible solutions to use runtime G1 region grain size in AOT code (JDK-8335440)

Mon Jul 15 21:54:21 UTC 2024

On 7/15/24 9:23 AM, Andrew Dinn wrote:
> I have prototyped two (aarch64-specific) solutions for JDK-8335440 both of which fix the G1 write barrier in AOT code to 
> use the runtime region grain size. Both solutions make AOT code resilient to any change in max heap between assembly and 
> production runs.
> 
> The problem arises because ergonomics uses the heap size to derive a G1 region size and the latter size determines what 
> shift is needed to convert a store address to a card table index. In currently generated nmethod and *stub* code) the 
> shift count is installed as an immediate operand of a generated shift instruction. In AOT code the shift counts needs to 
> be appropriate to the current runtime region size. AOT code can resolve this requirement in two ways. It can load the 
> shift from a well known location and supply the shift count as a register operand. Alternatively, it can employ 
> load-time rewriting of the instruction stream to update the immediate operand.
> 
> Both current solutions rely on loading rather than instruction rewriting. The first solution installs the shift count in 
> a (byte) field added to every Java thread. It modifies barrier generation when the SCCache is open for writing to load 
> the shift count from the thread field. This solution requires no relocation when the AOT stub/nmethod is loaded from the 
> cache since the load is always at a fixed offset from the thread register. If the SCCache is not open for writing the 
> count is generated as normal i.e. as an immediate operand.
> 
> 
> https://github.com/adinn/leyden/compare/premain...adinn:leyden:JDK-8335440-load-via-thread

My main concern with first solution is that we add a field to `Thread` class which will be used only at the start by AOT 
code. This is also load which may miss CPU's cache.

On other hand it is "straight-forward" simple change.

> 
> The second solution modifies barrier generation when the SCCache is open for writing to load the shift count from a 
> runtime field, G1HeapRegion::LogHRGrainSize i.e. the same field that determines the immediate count used for normal 
> generation. In order to make this visible to the compilers and SCC address table the address of this field is exported 
> via the card table. This solution requires the AOT code to reference the target address using a runtime address 
> relocation. Once again, if the SCCache is not open for writing the count is generated as normal i.e. as an immediate 
> operand.
> 
> 
> https://github.com/adinn/leyden/compare/premain...adinn:leyden:JDK-8335440-load-via-memory

This is more complicated change which looks like half-step to direction to have separate relocation for it.

> 
> I ran the javac benchmark with each of these solutions compared against the equivalent premain build. Neither indicated 
> any noticeable change in execution time -- at least not within the margins of error of the test runs (which, on my M2 
> machine were +/- 5 msecs in a 95 msec run). A better test might be to take a long running app and see if the change to 
> the AOT barrier code introduced any change in overall execution time.
> 
> I implemented these two solutions first because neither of them requires implementing any new relocations. There are two 
> alternatives which would require new relocations that may still be worth investigating. Option three is to mark the 
> shift instruction with a new relocation. Patching of the relocation address would simply require regenerating it with an 
> immediate that matches (log2 of the) current region size.

Please, try this approach (new relocation).

Thanks,
Vladimir K

> 
> The fourth option is to load the shift count from a data area associated with the current blob. In the case of an 
> nmethod this would be the nmethod constants section. In the case of a generated stub this would have to be a dedicated 
> memory address in its associated blob. Either way the data location would need to be marked with a new relocation. 
> Patching of the relocation address would simply require copying the (log2 of the) current region size ito the data area.
> 
> I'll hold off on adding these solutions (also on implementing the x86 versions -- well, more likely, letting Ashu 
> provide them ;-) until I get some feedback on these first two. I'll also see if I can get any better indication of 
> whether the performance of the first two solutions is an issue. I think solution one is by far the simplest, resolving 
> the immediate issue with least fuss (note that I poked the necessary data byte into a hole in the thread record so it 
> has no space implications). However, if we end up having to tweak generated code to deal with other config changes the 
> alternatives may be worth investing in as they might scale better.
> 
> regards,
> 
> 
> Andrew Dinn
> -----------
>