premain: Possible solutions to use runtime G1 region grain size in AOT code (JDK-8335440)
Vladimir Kozlov
vladimir.kozlov at oracle.com
Mon Jul 15 21:54:21 UTC 2024
On 7/15/24 9:23 AM, Andrew Dinn wrote:
> I have prototyped two (aarch64-specific) solutions for JDK-8335440 both of which fix the G1 write barrier in AOT code to
> use the runtime region grain size. Both solutions make AOT code resilient to any change in max heap between assembly and
> production runs.
>
> The problem arises because ergonomics uses the heap size to derive a G1 region size and the latter size determines what
> shift is needed to convert a store address to a card table index. In currently generated nmethod and *stub* code) the
> shift count is installed as an immediate operand of a generated shift instruction. In AOT code the shift counts needs to
> be appropriate to the current runtime region size. AOT code can resolve this requirement in two ways. It can load the
> shift from a well known location and supply the shift count as a register operand. Alternatively, it can employ
> load-time rewriting of the instruction stream to update the immediate operand.
>
> Both current solutions rely on loading rather than instruction rewriting. The first solution installs the shift count in
> a (byte) field added to every Java thread. It modifies barrier generation when the SCCache is open for writing to load
> the shift count from the thread field. This solution requires no relocation when the AOT stub/nmethod is loaded from the
> cache since the load is always at a fixed offset from the thread register. If the SCCache is not open for writing the
> count is generated as normal i.e. as an immediate operand.
>
>
> https://github.com/adinn/leyden/compare/premain...adinn:leyden:JDK-8335440-load-via-thread
My main concern with first solution is that we add a field to `Thread` class which will be used only at the start by AOT
code. This is also load which may miss CPU's cache.
On other hand it is "straight-forward" simple change.
>
> The second solution modifies barrier generation when the SCCache is open for writing to load the shift count from a
> runtime field, G1HeapRegion::LogHRGrainSize i.e. the same field that determines the immediate count used for normal
> generation. In order to make this visible to the compilers and SCC address table the address of this field is exported
> via the card table. This solution requires the AOT code to reference the target address using a runtime address
> relocation. Once again, if the SCCache is not open for writing the count is generated as normal i.e. as an immediate
> operand.
>
>
> https://github.com/adinn/leyden/compare/premain...adinn:leyden:JDK-8335440-load-via-memory
This is more complicated change which looks like half-step to direction to have separate relocation for it.
>
> I ran the javac benchmark with each of these solutions compared against the equivalent premain build. Neither indicated
> any noticeable change in execution time -- at least not within the margins of error of the test runs (which, on my M2
> machine were +/- 5 msecs in a 95 msec run). A better test might be to take a long running app and see if the change to
> the AOT barrier code introduced any change in overall execution time.
>
> I implemented these two solutions first because neither of them requires implementing any new relocations. There are two
> alternatives which would require new relocations that may still be worth investigating. Option three is to mark the
> shift instruction with a new relocation. Patching of the relocation address would simply require regenerating it with an
> immediate that matches (log2 of the) current region size.
Please, try this approach (new relocation).
Thanks,
Vladimir K
>
> The fourth option is to load the shift count from a data area associated with the current blob. In the case of an
> nmethod this would be the nmethod constants section. In the case of a generated stub this would have to be a dedicated
> memory address in its associated blob. Either way the data location would need to be marked with a new relocation.
> Patching of the relocation address would simply require copying the (log2 of the) current region size ito the data area.
>
> I'll hold off on adding these solutions (also on implementing the x86 versions -- well, more likely, letting Ashu
> provide them ;-) until I get some feedback on these first two. I'll also see if I can get any better indication of
> whether the performance of the first two solutions is an issue. I think solution one is by far the simplest, resolving
> the immediate issue with least fuss (note that I poked the necessary data byte into a hole in the thread record so it
> has no space implications). However, if we end up having to tweak generated code to deal with other config changes the
> alternatives may be worth investing in as they might scale better.
>
> regards,
>
>
> Andrew Dinn
> -----------
>
More information about the leyden-dev
mailing list