RFR: 8353558: x86: Use CLFLUSHOPT/CLWB/CPUID for ICache sync [v2]
Andrew Dinn
adinn at openjdk.org
Tue Apr 15 12:43:47 UTC 2025
On Tue, 15 Apr 2025 10:58:36 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:
>> For Leyden, that wants to load a lot of code as fast as it can, code cache flush costs are now significant part of the picture. There are single-digit percent startup time opportunities in better ICache syncs.
>>
>> It is not sufficiently clear why icache flushes are needed for x86. Intel/AMD manuals say the instruction caches are fully coherent. GCC intrinsic for `__builtin___clear_cache` is empty. It looks that a single serializing instruction like `cpuid` might be OK for the entire flush to happen, this is what our `OrderAccess::cross_modify_fence` does. Still, we can maintain the old behavior by flushing the caches smarter: there are CLFLUSHOPT and CLWB available on modern x86.
>>
>> See more discussion and references in the RFE. The performance data is in the comments in this PR.
>>
>> Additional testing:
>> - [x] Linux x86_64 server fastdebug, `all`
>> - [x] Linux x86_64 server fastdebug, `all` + `X86ICacheSync={0,1,2,3,4}`
>
> Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains one commit:
>
> Fix
Looks good.
I also strongly suspect that a flush is not needed to achieve DCache <-> ICache coherence. The CLFLUSH/FLUSHOPT and CLWB instructions are really about pushing writes down through the cache hierarchy to(wards) memory. It certainly does appear from the documentation that x86 ought to enforce D/ICache coherence no matter where the data sits in cache or memory and the implementations of `__builtin__clear_cache` and `cross_modify_fence` strongly strongly reinforce that. I'll also note that although we saw a repeatable problem on AArch64 when the Ashu's draft single-copy patch installed AOT adapters without a flush we did not see a corresponding problem on x86.
The important constraint that needs to be met here is that no thread tries to jump into the code at/following the method entry address before the instruction writes are visible. Since update of the compiled entry address on the method follows writing of the code *in the same thread* I believe we should get dcache <-> icache coherence automatically from the fact x86 is a TSO CPU.
That said, it's maybe best for now to be conservative and retain the flushes using the most performant available writeback instruction but provide an override to use all other available options -- as the current patch does. That way users can experiment with a safe fallback. We can switch the default to not flush when we feel more confident it is ok.
-------------
Marked as reviewed by adinn (Reviewer).
PR Review: https://git.openjdk.org/jdk/pull/24389#pullrequestreview-2768102420
PR Comment: https://git.openjdk.org/jdk/pull/24389#issuecomment-2804914208
More information about the hotspot-dev
mailing list