premain: Possible solutions to use runtime G1 region grain size in AOT code (JDK-8335440)
Andrew Dinn
adinn at redhat.com
Mon Jul 15 16:23:36 UTC 2024
I have prototyped two (aarch64-specific) solutions for JDK-8335440 both
of which fix the G1 write barrier in AOT code to use the runtime region
grain size. Both solutions make AOT code resilient to any change in max
heap between assembly and production runs.
The problem arises because ergonomics uses the heap size to derive a G1
region size and the latter size determines what shift is needed to
convert a store address to a card table index. In currently generated
nmethod and *stub* code) the shift count is installed as an immediate
operand of a generated shift instruction. In AOT code the shift counts
needs to be appropriate to the current runtime region size. AOT code can
resolve this requirement in two ways. It can load the shift from a well
known location and supply the shift count as a register operand.
Alternatively, it can employ load-time rewriting of the instruction
stream to update the immediate operand.
Both current solutions rely on loading rather than instruction
rewriting. The first solution installs the shift count in a (byte) field
added to every Java thread. It modifies barrier generation when the
SCCache is open for writing to load the shift count from the thread
field. This solution requires no relocation when the AOT stub/nmethod is
loaded from the cache since the load is always at a fixed offset from
the thread register. If the SCCache is not open for writing the count is
generated as normal i.e. as an immediate operand.
https://github.com/adinn/leyden/compare/premain...adinn:leyden:JDK-8335440-load-via-thread
The second solution modifies barrier generation when the SCCache is open
for writing to load the shift count from a runtime field,
G1HeapRegion::LogHRGrainSize i.e. the same field that determines the
immediate count used for normal generation. In order to make this
visible to the compilers and SCC address table the address of this field
is exported via the card table. This solution requires the AOT code to
reference the target address using a runtime address relocation. Once
again, if the SCCache is not open for writing the count is generated as
normal i.e. as an immediate operand.
https://github.com/adinn/leyden/compare/premain...adinn:leyden:JDK-8335440-load-via-memory
I ran the javac benchmark with each of these solutions compared against
the equivalent premain build. Neither indicated any noticeable change in
execution time -- at least not within the margins of error of the test
runs (which, on my M2 machine were +/- 5 msecs in a 95 msec run). A
better test might be to take a long running app and see if the change to
the AOT barrier code introduced any change in overall execution time.
I implemented these two solutions first because neither of them requires
implementing any new relocations. There are two alternatives which would
require new relocations that may still be worth investigating. Option
three is to mark the shift instruction with a new relocation. Patching
of the relocation address would simply require regenerating it with an
immediate that matches (log2 of the) current region size.
The fourth option is to load the shift count from a data area associated
with the current blob. In the case of an nmethod this would be the
nmethod constants section. In the case of a generated stub this would
have to be a dedicated memory address in its associated blob. Either way
the data location would need to be marked with a new relocation.
Patching of the relocation address would simply require copying the
(log2 of the) current region size ito the data area.
I'll hold off on adding these solutions (also on implementing the x86
versions -- well, more likely, letting Ashu provide them ;-) until I get
some feedback on these first two. I'll also see if I can get any better
indication of whether the performance of the first two solutions is an
issue. I think solution one is by far the simplest, resolving the
immediate issue with least fuss (note that I poked the necessary data
byte into a hole in the thread record so it has no space implications).
However, if we end up having to tweak generated code to deal with other
config changes the alternatives may be worth investing in as they might
scale better.
regards,
Andrew Dinn
-----------
More information about the leyden-dev
mailing list