premain: Possible solutions to use runtime G1 region grain size in AOT code (JDK-8335440)

Mon Jul 15 16:23:36 UTC 2024

I have prototyped two (aarch64-specific) solutions for JDK-8335440 both 
of which fix the G1 write barrier in AOT code to use the runtime region 
grain size. Both solutions make AOT code resilient to any change in max 
heap between assembly and production runs.

The problem arises because ergonomics uses the heap size to derive a G1 
region size and the latter size determines what shift is needed to 
convert a store address to a card table index. In currently generated 
nmethod and *stub* code) the shift count is installed as an immediate 
operand of a generated shift instruction. In AOT code the shift counts 
needs to be appropriate to the current runtime region size. AOT code can 
resolve this requirement in two ways. It can load the shift from a well 
known location and supply the shift count as a register operand. 
Alternatively, it can employ load-time rewriting of the instruction 
stream to update the immediate operand.

Both current solutions rely on loading rather than instruction 
rewriting. The first solution installs the shift count in a (byte) field 
added to every Java thread. It modifies barrier generation when the 
SCCache is open for writing to load the shift count from the thread 
field. This solution requires no relocation when the AOT stub/nmethod is 
loaded from the cache since the load is always at a fixed offset from 
the thread register. If the SCCache is not open for writing the count is 
generated as normal i.e. as an immediate operand.

https://github.com/adinn/leyden/compare/premain...adinn:leyden:JDK-8335440-load-via-thread

The second solution modifies barrier generation when the SCCache is open 
for writing to load the shift count from a runtime field, 
G1HeapRegion::LogHRGrainSize i.e. the same field that determines the 
immediate count used for normal generation. In order to make this 
visible to the compilers and SCC address table the address of this field 
is exported via the card table. This solution requires the AOT code to 
reference the target address using a runtime address relocation. Once 
again, if the SCCache is not open for writing the count is generated as 
normal i.e. as an immediate operand.

https://github.com/adinn/leyden/compare/premain...adinn:leyden:JDK-8335440-load-via-memory

I ran the javac benchmark with each of these solutions compared against 
the equivalent premain build. Neither indicated any noticeable change in 
execution time -- at least not within the margins of error of the test 
runs (which, on my M2 machine were +/- 5 msecs in a 95 msec run). A 
better test might be to take a long running app and see if the change to 
the AOT barrier code introduced any change in overall execution time.

I implemented these two solutions first because neither of them requires 
implementing any new relocations. There are two alternatives which would 
require new relocations that may still be worth investigating. Option 
three is to mark the shift instruction with a new relocation. Patching 
of the relocation address would simply require regenerating it with an 
immediate that matches (log2 of the) current region size.

The fourth option is to load the shift count from a data area associated 
with the current blob. In the case of an nmethod this would be the 
nmethod constants section. In the case of a generated stub this would 
have to be a dedicated memory address in its associated blob. Either way 
the data location would need to be marked with a new relocation. 
Patching of the relocation address would simply require copying the 
(log2 of the) current region size ito the data area.

I'll hold off on adding these solutions (also on implementing the x86 
versions -- well, more likely, letting Ashu provide them ;-) until I get 
some feedback on these first two. I'll also see if I can get any better 
indication of whether the performance of the first two solutions is an 
issue. I think solution one is by far the simplest, resolving the 
immediate issue with least fuss (note that I poked the necessary data 
byte into a hole in the thread record so it has no space implications). 
However, if we end up having to tweak generated code to deal with other 
config changes the alternatives may be worth investing in as they might 
scale better.

regards,

Andrew Dinn
-----------