RFR: 8345067: C2: enable implicit null checks for ZGC reads

Mon May 12 14:48:59 UTC 2025

On Tue, 6 May 2025 13:28:28 GMT, Roberto Castañeda Lozano <rcastanedalo at openjdk.org> wrote:

> Currently, C2 cannot exploit late-expanded GC memory accesses as implicit null checks because of their use of temporary operands (`MachTemp`), which prevents `PhaseCFG::implicit_null_check` from [hoisting the memory accesses to the test basic block](https://github.com/openjdk/jdk/blob/f88c1c6ff86b8f29a71647e46136b6432bb67619/src/hotspot/share/opto/lcm.cpp#L319-L335).
> 
> This changeset extends the scope of the implicit null check optimization so that it can exploit ZGC object loads. It introduces a platform-dependent predicate (`MachNode::is_late_expanded_null_check_candidate`) to mark late-expanded instructions that emit a suitable memory access as a first instruction as candidates, and extends the optimization to recognize and hoist candidate memory accesses that use temporary operands:
> 
> ![example](https://github.com/user-attachments/assets/b5f9bbc8-d75d-4cf3-841e-73db3dbae753)
> 
> ZGC object loads are marked as late-expanded null-check candidates unconditionally on all ZGC-supported platforms except on aarch64, where only loads that do not require an initial `lea` instruction (due to [address legitimization](https://github.com/openjdk/jdk/blob/ddd07b107e814ec846579a66d4f2005b7db9bb2f/src/hotspot/cpu/aarch64/macroAssembler_aarch64.hpp#L132-L144)) are marked as candidates. Fortunately, most aarch64 loads seen in practice use small offsets and can be marked as candidates.
> 
> Exploiting ZGC loads increases the effectiveness of the implicit null check optimization (percent of explicit null checks turned into implicit ones at compile time) by around 10% in the DaCapo23 benchmarks. This results in slight performance improvements (in the 1-2% range) in a few DaCapo and SPECjvm2008 benchmarks and an overall slight improvement across Renaissance benchmarks.
> 
> #### Testing
> - tier1-5, compiler stress test (linux-x64, macosx-x64, windows-x64, linux-aarch64, macosx-aarch64; release and debug mode).

Thanks for looking at this PR, Emanuel!

> It is a limitation that we require the first operation to be the memory access. But the alternative would probably be significantly more complicated, i.e. to track the location of all the memory locations.

Right, I have prototyped this alternative in the wider context of [JDK-8344627](https://bugs.openjdk.org/browse/JDK-8344627) since it would be required for using writes as implicit null checks (both in ZGC and G1), and it indeed adds some complexity to `PhaseOutput` and other places (see https://github.com/openjdk/jdk/compare/master...robcasloz:jdk:JDK-implicit-null-checks). I ran some preliminary experiments and could not see enough benefits to justify the additional complexity.

> In our offline discussion, I had some hesitation about the case where the load is at the beginning, but the barrier may have more loads. I wondered: what if the first load does not trigger the NullPointerException, but a later load then encounters the null pointer.

This cannot happen because the address we are loading from is constant through the barrier, see e.g. the code generated for a zLoadP in x64 (AT&T syntax):

0x00007514c47d6aa0:  movq 0x10(%rsi), %rax    ; main OOP load with implicit exception: dispatches to 0x00007514c47d6abe
0x00007514c47d6aa4:  shrq $0xd, %rax          ; uncolor, destroys the OOP loaded in %rax
0x00007514c47d6aa8:  ja   0x36                ; jump to barrier stub (slow path)

(...)

0x00007514c47d6abe:  trigger uncommon trap (null_check)

(...)

barrier stub (slow path):
0x00007514c47d6ae4:  movq 0x10(%rsi), %rax    ; re-load OOP that was destroyed by uncoloring
(...)                                         ; call into runtime (ZBarrierSetRuntime::load_barrier_on_oop_field_preloaded(oopDesc*, oop*))
0x00007514c47d6b09:  jmp  -0x5d               ; go back to main code section

Note how the address we might fault on (triggering the implicit exception) is stored on `%rsi` (base address) + `0x10` (field offset), which is not changed between the main load and the slow-path reload.

> I think I was also worried that we would re-load the pointer itself. Then the old pointer may be non-null, but once we load the pointer again it may be null because another thread changed the reference. But now I thought about that again: that would really violate the Java Memory Model, you cannot duplicate the load of the pointer. So I suppose rather we got the old pointer from somewhere, and then we check if that old pointer is still valid in the barrier, and if not, we somehow directly translate the old pointer to a new pointer? Is that what the oop map is used for?

I am not sure I understand the question, could you perhaps re-formulate it using some example to make it more concrete?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/25066#issuecomment-2872870543