RFR: 8340313: Crash due to invalid oop in nmethod after C1 patching
Dean Long
dlong at openjdk.org
Fri Oct 11 20:00:17 UTC 2024
On Mon, 7 Oct 2024 13:39:28 GMT, Tobias Hartmann <thartmann at openjdk.org> wrote:
> C1 patching (explained in detail [here](https://github.com/openjdk/jdk/blob/212e32931cafe446d94219d6c3ffd92261984dff/src/hotspot/share/c1/c1_Runtime1.cpp#L835)) works by rewriting the [patch body](https://github.com/openjdk/jdk/blob/212e32931cafe446d94219d6c3ffd92261984dff/src/hotspot/share/c1/c1_Runtime1.cpp#L860), usually a move, and then copying it into place over top of the jmp instruction, being careful to flush caches and doing it in an MP-safe way.
>
> Now the problem is that there can be multiple patch sides in one nmethod (for example, one for each field access in `TestConcurrentPatching::test`) and multiple threads executing that nmethod can trigger patching concurrently. Although the patch body is not executed, one `Thread A` can update the oop immediate of the `mov` in the patch body:
> https://github.com/openjdk/jdk/blob/212e32931cafe446d94219d6c3ffd92261984dff/src/hotspot/share/c1/c1_Runtime1.cpp#L1185
>
> While another `Thread B` is done with patching another patch side and already walks the nmethod oops via `Universe::heap()->register_nmethod(nm)` to notify the GC. `Thread B` might then encounter a half-written oop from `Thread A` if the immediate crosses a page or cache-line boundary:
> https://github.com/openjdk/jdk/blob/212e32931cafe446d94219d6c3ffd92261984dff/src/hotspot/share/c1/c1_Runtime1.cpp#L1282
>
> In short:
> - `Thread A`: Patches location 1 in an nmethod and executes `register_nmethod()`, walking all immediate oops.
> - `Thread B`: Patches location 2 in the same nmethod and just wrote half of the oop immediate of an `mov` in the patch body because the store is not atomic.
> - `Thread A`: Crashes when walking the immediate oops of the nmethod and encountering the oop just partially written by `Thread B` concurrently.
>
> Updating the oop immediate is not atomic on x86_64 because the address of the immediate is not 8-byte aligned.
> I propose to simply align it in `PatchingStub::emit_code` to guarantee atomicity.
>
> The new regression test triggers the issue reliably for the `load_mirror_patching_id` case but unfortunately, I was not able to trigger the `load_appendix_patching_id` case which should be affected as well:
> https://github.com/openjdk/jdk/blob/212e32931cafe446d94219d6c3ffd92261984dff/src/hotspot/share/c1/c1_Runtime1.cpp#L1197
>
> I still added a corresponding test case `testIndy` and a [new assert that checks proper alignment](https://github.com/openjdk/jdk/commit/f418bc01c946b4c76f4bceac1ad503dabe182df7) triggers in both scenarios. I had to remo...
Sorry, I'm still not convinced this is safe. I took another look at C1 patching, and not only are we trying to scan oops while they are being patched, we are also patching the reloc information at the same time (see the call to change_reloc_info_for_address()). So that means there is a window where the instruction is patched but the reloc information is stale.
If the scope of the lock is the only issue, then we could try to address that with a finer-grained lock or even a per-nmethod lock.
@tschatzl , when we call register_nmethod(), do we really need to scan the oops immediately, or could that be delayed until the next safepoint?
-------------
PR Comment: https://git.openjdk.org/jdk/pull/21389#issuecomment-2408043697
More information about the hotspot-compiler-dev
mailing list