RFR: 8370947: Mitigate Neoverse-N1 erratum 1542419 negative impact on GCs and JIT performance [v28]

Erik Österlund eosterlund at openjdk.org
Fri Feb 20 15:06:25 UTC 2026


On Wed, 18 Feb 2026 18:45:22 GMT, Evgeny Astigeevich <eastigeevich at openjdk.org> wrote:

>> Arm Neoverse N1 erratum 1542419: "The core might fetch a stale instruction from memory which violates the ordering of instruction fetches". It is fixed in Neoverse N1 r4p1.
>>  
>> Neoverse-N1 implementations mitigate erratum 1542419 with a workaround:
>> - Disable coherent icache.
>> - Trap IC IVAU instructions.
>> - Execute:
>>    - `tlbi vae3is, xzr`
>>    - `dsb sy`
>>  
>>  `tlbi vae3is, xzr` invalidates translations for all address spaces (global for address).  It waits for all memory accesses using in-scope old translation information to complete before it is considered complete.
>>  
>> As this workaround has significant overhead, Arm Neoverse N1 (MP050) Software Developer Errata Notice version 29.0 suggests:
>> 
>> "Since one TLB inner-shareable invalidation is enough to avoid this erratum, the number of injected TLB invalidations should be minimized in the trap handler to mitigate the performance impact due to this workaround."
>> 
>> This PR introduces a mechanism to defer instruction cache (ICache) invalidation for AArch64 to address the Arm Neoverse N1 erratum 1542419, which causes significant performance overhead if ICache invalidation is performed too frequently. The implementation includes detection of affected Neoverse N1 CPUs and automatic enabling of the workaround for relevant Neoverse N1 revisions.
>> 
>> Changes include:
>> 
>> * Added a new diagnostic AArch64 JVM flag `NeoverseN1Errata1542419` to enable or disable the workaround for the erratum. The flag is automatically enabled for Neoverse N1 CPUs prior to r4p1, as detected during VM initialization.
>> * Added a new diagnostic JVM flag `UseDeferredICacheInvalidation` to enable or disable defered icache invalidation. The flag is automatically enabled for AArch64 if CPU supports hardware cache coherence.
>> * Introduced the `ICacheInvalidationContext` class to manage deferred ICache invalidation, with platform-specific logic for AArch64. This context is used to batch ICache invalidations, reducing performance impact.
>> * Provided a default (no-op) implementation for `DefaultICacheInvalidationContext` on platforms where the workaround is not needed, ensuring portability and minimal impact on other architectures.
>> 
>> **Testing results: linux fastdebug build**
>> - Neoverse-N1 (Graviton 2)
>>    - [x] tier1: passed
>>    - [x] tier2: passed
>>    - [x] tier3: passed
>>    - [x] tier4: 3 failures
>>       - `containers/docker/TestJcmdWithSideCar.java`: JDK-8341518
>>       - `com/sun/nio/sctp/SctpChannel/CloseDe...
>
> Evgeny Astigeevich has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Do fullGC when multiple threads execute test methods

> > > We automatically, if an user does not use cmd options, detect whether we can use single invalidation and need mitigation of the errata.
> > 
> > 
> > So is UseSingleICacheInvalidation a win, when you don't have the erratum?
> 
> Yes, it is. Graviton 3 without the N1 bug has the following improvements for ZGC: [#28328 (comment)](https://github.com/openjdk/jdk/pull/28328#issuecomment-3585923078)
> 
> ```
>     2 fields accessed
>         Full GC: 13%
>         System GC: 14%
>         Young GC: 28%
>     4 fields accessed
>         Full GC: 20%
>         System GC: 20%
>         Young GC: 33%
>     8 fields accessed
>         Full GC: 26%
>         System GC: 26%
>         Young GC: 40%
> ```
> 
> I might need to rerun a microbenchmark for the master tip.
> 
> > Also, perhaps if we can check that we have reliable icache coherency, we can have a tighter nmethod entry barrier, more similar to what we have on x86, without the epoching checks that I added because we can't generally rely on instruction cache coherence being present. But that sounds like some fun for another day.
> 
> Yes, checking IDC/DIC bits is a reliable way to check for full hardware cache coherence.

Thank you for the numbers. Is this reporting the speed of running GC cycles? That's nice. But in reality, these numbers are not super meaningful for concurrent GCs unless there is urgency. What I'm more worried about is what the effect is on the application threads. It's not obvious to me that this is a winning strategy there, but maybe it is? It would be encouraging to know that at least some macro benchmarks don't regress, and not just that some micro benchmarks show improvements on GC cycle lengths (assuming that's what we are measuring).

-------------

PR Comment: https://git.openjdk.org/jdk/pull/28328#issuecomment-3935463583


More information about the shenandoah-dev mailing list