premain: Possible solutions to use runtime G1 region grain size in AOT code (JDK-8335440)
ioi.lam at oracle.com
ioi.lam at oracle.com
Tue Jul 16 16:33:56 UTC 2024
On 7/15/24 9:23 AM, Andrew Dinn wrote:
> I have prototyped two (aarch64-specific) solutions for JDK-8335440
> both of which fix the G1 write barrier in AOT code to use the runtime
> region grain size. Both solutions make AOT code resilient to any
> change in max heap between assembly and production runs.
>
> The problem arises because ergonomics uses the heap size to derive a
> G1 region size and the latter size determines what shift is needed to
> convert a store address to a card table index. In currently generated
> nmethod and *stub* code) the shift count is installed as an immediate
> operand of a generated shift instruction. In AOT code the shift counts
> needs to be appropriate to the current runtime region size. AOT code
> can resolve this requirement in two ways. It can load the shift from a
> well known location and supply the shift count as a register operand.
> Alternatively, it can employ load-time rewriting of the instruction
> stream to update the immediate operand.
>
> Both current solutions rely on loading rather than instruction
> rewriting. The first solution installs the shift count in a (byte)
> field added to every Java thread. It modifies barrier generation when
> the SCCache is open for writing to load the shift count from the
> thread field. This solution requires no relocation when the AOT
> stub/nmethod is loaded from the cache since the load is always at a
> fixed offset from the thread register. If the SCCache is not open for
> writing the count is generated as normal i.e. as an immediate operand.
>
>
> https://github.com/adinn/leyden/compare/premain...adinn:leyden:JDK-8335440-load-via-thread
>
>
> The second solution modifies barrier generation when the SCCache is
> open for writing to load the shift count from a runtime field,
> G1HeapRegion::LogHRGrainSize i.e. the same field that determines the
> immediate count used for normal generation. In order to make this
> visible to the compilers and SCC address table the address of this
> field is exported via the card table. This solution requires the AOT
> code to reference the target address using a runtime address
> relocation. Once again, if the SCCache is not open for writing the
> count is generated as normal i.e. as an immediate operand.
>
>
Is the G1HeapRegion::LogHRGrainSize loaded with PC offset?
ldr grain, [pc, #5678]
I suppose this require us to put multiple copies of
G1HeapRegion::LogHRGrainSize inside the AOT code, as there's a limit for
the offset. But we will be patching fewer places than every sites that
needs to know the grain size.
Thanks
- Ioi
> https://github.com/adinn/leyden/compare/premain...adinn:leyden:JDK-8335440-load-via-memory
>
>
> I ran the javac benchmark with each of these solutions compared
> against the equivalent premain build. Neither indicated any noticeable
> change in execution time -- at least not within the margins of error
> of the test runs (which, on my M2 machine were +/- 5 msecs in a 95
> msec run). A better test might be to take a long running app and see
> if the change to the AOT barrier code introduced any change in overall
> execution time.
>
> I implemented these two solutions first because neither of them
> requires implementing any new relocations. There are two alternatives
> which would require new relocations that may still be worth
> investigating. Option three is to mark the shift instruction with a
> new relocation. Patching of the relocation address would simply
> require regenerating it with an immediate that matches (log2 of the)
> current region size.
>
> The fourth option is to load the shift count from a data area
> associated with the current blob. In the case of an nmethod this would
> be the nmethod constants section. In the case of a generated stub this
> would have to be a dedicated memory address in its associated blob.
> Either way the data location would need to be marked with a new
> relocation. Patching of the relocation address would simply require
> copying the (log2 of the) current region size ito the data area.
>
> I'll hold off on adding these solutions (also on implementing the x86
> versions -- well, more likely, letting Ashu provide them ;-) until I
> get some feedback on these first two. I'll also see if I can get any
> better indication of whether the performance of the first two
> solutions is an issue. I think solution one is by far the simplest,
> resolving the immediate issue with least fuss (note that I poked the
> necessary data byte into a hole in the thread record so it has no
> space implications). However, if we end up having to tweak generated
> code to deal with other config changes the alternatives may be worth
> investing in as they might scale better.
>
> regards,
>
>
> Andrew Dinn
> -----------
>
More information about the leyden-dev
mailing list