RFR: 8310239: Add missing cross modifying fence in nmethod entry barriers [v2]

Mon Aug 28 11:19:13 UTC 2023

On Tue, 8 Aug 2023 22:28:09 GMT, Dean Long <dlong at openjdk.org> wrote:

> At first glance, this seems like an optimization to me, and not a correctness issue. At worst, we continue to execute the stale compare and go through the slow path unnecessarily. I don't think we are violating AMD APM, at least for the compare for the entry barrier. However, there may be a problem if what that barrier is protecting, the nmethod code, was patched in a way that is not concurrent-execution-safe. If that is the case, or just to be safe, then executing a serializing instruction seems like a good idea. But we don't do that for all threads. For any thread that sees the patched instruction, we will just take the fast path and not executed the serializing instruction. Does visibility of the patched entry point instruction as code imply all patching to the nmethod is also visible, as both data and instructions? I'm not convinced it does.

Unfortunately, it is not just an optimization. For example, ZGC has always patched code oop relocations in the nmethod entry barriers on x86_64, which is guarded by the nmethod entry barrier. The assumption is that if the nmethod immediate oops are patched first, and the guard value (immediate of the cmp instruction) is patched after, then if a thread sees the new cmp instruction, it will also see the new oop immediates. And that is indeed what the "asynchronous" cross modifying code description ensures will work in the AMD APM. So that all checks out.
What doesn't check out though unfortunately, is when the opposite happens and we see the old compare value, and we take a slow path. After the slow path has executed, we don't re-execute the guarding cmp instruction. Instead we enter the nmethod, assuming that surely that should be safe after taking the slow path. But as the cmp isn't re-executed, this slow path code only uses *data* to communicate that the guard value has been updated. That is precisely what is described as "synchronous" cross modifying code in the AMD APM. Another thread could have *just* flipped that data to look fine concurrently. We load some the data that signals that the code has been updated, and then need to either fall back and re-execute the cmp to guard the code with the asynchronous behaviour, or issue a cross-modifying code fence, to ensure that the patched oop immediates become visible.
In summary, the fast-path is asynchronous cross-modifying code, and it's fine without the fence. But the slow path uses synchronous cross-modifying code, but without correctly using cross modification fence, which is required for correctness.
Now, this problem becomes bigger with generational ZGC, because we patch more things than just oop immediates, but the principle and problem is precisely the same.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14543#issuecomment-1695511290