RFR: 8370947: Mitigate Neoverse-N1 erratum 1542419 negative impact on GCs and JIT performance [v28]
Aleksey Shipilev
shade at openjdk.org
Thu Feb 19 16:34:25 UTC 2026
On Wed, 18 Feb 2026 18:45:22 GMT, Evgeny Astigeevich <eastigeevich at openjdk.org> wrote:
>> Arm Neoverse N1 erratum 1542419: "The core might fetch a stale instruction from memory which violates the ordering of instruction fetches". It is fixed in Neoverse N1 r4p1.
>>
>> Neoverse-N1 implementations mitigate erratum 1542419 with a workaround:
>> - Disable coherent icache.
>> - Trap IC IVAU instructions.
>> - Execute:
>> - `tlbi vae3is, xzr`
>> - `dsb sy`
>>
>> `tlbi vae3is, xzr` invalidates translations for all address spaces (global for address). It waits for all memory accesses using in-scope old translation information to complete before it is considered complete.
>>
>> As this workaround has significant overhead, Arm Neoverse N1 (MP050) Software Developer Errata Notice version 29.0 suggests:
>>
>> "Since one TLB inner-shareable invalidation is enough to avoid this erratum, the number of injected TLB invalidations should be minimized in the trap handler to mitigate the performance impact due to this workaround."
>>
>> This PR introduces a mechanism to defer instruction cache (ICache) invalidation for AArch64 to address the Arm Neoverse N1 erratum 1542419, which causes significant performance overhead if ICache invalidation is performed too frequently. The implementation includes detection of affected Neoverse N1 CPUs and automatic enabling of the workaround for relevant Neoverse N1 revisions.
>>
>> Changes include:
>>
>> * Added a new diagnostic AArch64 JVM flag `NeoverseN1Errata1542419` to enable or disable the workaround for the erratum. The flag is automatically enabled for Neoverse N1 CPUs prior to r4p1, as detected during VM initialization.
>> * Added a new diagnostic JVM flag `UseDeferredICacheInvalidation` to enable or disable defered icache invalidation. The flag is automatically enabled for AArch64 if CPU supports hardware cache coherence.
>> * Introduced the `ICacheInvalidationContext` class to manage deferred ICache invalidation, with platform-specific logic for AArch64. This context is used to batch ICache invalidations, reducing performance impact.
>> * Provided a default (no-op) implementation for `DefaultICacheInvalidationContext` on platforms where the workaround is not needed, ensuring portability and minimal impact on other architectures.
>>
>> **Testing results: linux fastdebug build**
>> - Neoverse-N1 (Graviton 2)
>> - [x] tier1: passed
>> - [x] tier2: passed
>> - [x] tier3: passed
>> - [x] tier4: 3 failures
>> - `containers/docker/TestJcmdWithSideCar.java`: JDK-8341518
>> - `com/sun/nio/sctp/SctpChannel/CloseDe...
>
> Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision:
>
> Do fullGC when multiple threads execute test methods
Took a brief look, and I already have questions. Feels like we need to look at simplifications.
Overall, I do not completely understand the dependencies between `UseDeferredICacheInvalidation`, `NeoverseN1...`, DIC/IDC features. Is there a way to simplify this? For example, always opt into deferred invalidation, and then let deferred invalidation code mitigate the errata? I.e. if you don't defer, you get no mitigation, you eat the cost of `__builtin_clear_cache` all the time.
src/hotspot/cpu/aarch64/globals_aarch64.hpp line 130:
> 128: product(bool, AlwaysMergeDMB, true, DIAGNOSTIC, \
> 129: "Always merge DMB instructions in code emission") \
> 130: product(bool, NeoverseN1Errata1542419, false, DIAGNOSTIC, \
Sounds like a mouthful with these numbers? We have a mitigations for Intel like that, see `IntelJccErratumMitigation`, so IMO it makes sense to match that. For example, `NeoverseICacheErratumMitigation`? This also captures the intent: we are _mitigating_ the errata.
src/hotspot/share/code/nmethod.cpp line 2084:
> 2082: }
> 2083:
> 2084: void nmethod::fix_non_immediate_oop_relocations(ICacheInvalidationContext* icic) {
So this almost duplicates `fix_all_oop_relocations()`, correct? What do we gain from doing so? The old code has boolean flag that gates processing immediates or not, it sounds less error-prone to keep it that way?
src/hotspot/share/code/nmethod.cpp line 2099:
> 2097: } else if (iter.type() == relocInfo::metadata_type) {
> 2098: metadata_Relocation* reloc = iter.metadata_reloc();
> 2099: modified_inst = reloc->fix_metadata_relocation();
Sounds like `modified_inst` is just collecting for the sake of flipping `modified_code` to `true`? Why not just `modified_code |= ...` all these uses?
src/hotspot/share/code/relocInfo.cpp line 621:
> 619:
> 620:
> 621: bool metadata_Relocation::fix_metadata_relocation() {
I understand we return the status here, so that we avoid invalidation when there is no real patching work was done. Granted, it likely matches the behavior we have. But I wonder how much this buys us? If we are doing the deferred invalidation in a very broad scope, it stands to reason we would _almost definitely_ have to invalidate, and all this tracking would _nearly always_ give us the same answer, "Do invalidate"?
IOW, this might be an unnecessary complication of the interface.
_Dropping_ this change would also be more robust, in cases something somewhere _forgets_ to announce the code cache was changed? If we don't trust the downstream code about this and just summarily invalidate, it feels safer.
src/hotspot/share/gc/g1/g1NMethodClosure.cpp line 90:
> 88: }
> 89:
> 90: nm->fix_non_immediate_oop_relocations();
I don't quite understand this replacement. Why can't/shouldn't we call `fix_all_oop_relocations()` here, so we get to the same code? I understand Shenandoah s `_has_non_immed_oops` in this check, but G1 does not have it.
To phrase it differently, what do we _lose_ by going into `fix_all_oop_relocations()`?
src/hotspot/share/runtime/icache.hpp line 74:
> 72: };
> 73:
> 74:
Superfluous.
test/hotspot/jtreg/gc/TestDeferredICacheInvalidation.java line 221:
> 219: WB.enqueueMethodForCompilation(m, compLevel);
> 220: while (WB.isMethodQueuedForCompilation(m)) {
> 221: Thread.onSpinWait();
Burns CPU for no reason, surely compiler would not respond in microseconds. Just do a small sleep?
-------------
PR Review: https://git.openjdk.org/jdk/pull/28328#pullrequestreview-3826772880
PR Review Comment: https://git.openjdk.org/jdk/pull/28328#discussion_r2828657373
PR Review Comment: https://git.openjdk.org/jdk/pull/28328#discussion_r2828753248
PR Review Comment: https://git.openjdk.org/jdk/pull/28328#discussion_r2828671168
PR Review Comment: https://git.openjdk.org/jdk/pull/28328#discussion_r2828774643
PR Review Comment: https://git.openjdk.org/jdk/pull/28328#discussion_r2828742009
PR Review Comment: https://git.openjdk.org/jdk/pull/28328#discussion_r2828778374
PR Review Comment: https://git.openjdk.org/jdk/pull/28328#discussion_r2828786123
More information about the shenandoah-dev
mailing list