RFR: 8370947: Mitigate Neoverse-N1 erratum 1542419 negative impact on GenZGC performance [v3]

Thu Nov 27 13:41:55 UTC 2025

On Tue, 25 Nov 2025 13:04:55 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> Yeah patching all nmethods as one unit is basically equivalent to making the code cache processing a STW operation. Last time we processed the code cache STW was JDK 11. A dark place I don't want to go back to. It can get pretty big and mess up latency. So I'm in favour of limiting the fix and not re-introduce STW code cache processing.
>> 
>> Otherwise yes you are correct; we perform synchronous cross modifying code with no assumptions about instruction cache coherency because we didn't trust it would actually work for all ARM implementations. Seems like that was a good bet. We rely on it on x64 still though.
>> 
>> It's a bit surprising to me if they invalidate all TLB entries, effectively ripping out the entire virtual address space, even when a range is passed in. If so, a horrible alternative might be to use mprotect to temporarily remove execution permission on the affected per nmethod pages, and detect over shooting in the signal handler, resuming execution when execution privileges are then restored immediately after. That should limit the affected VA to close to what is actually invalidated. But it would look horrible.
>
>> It's a bit surprising to me if they invalidate all TLB entries, effectively ripping out the entire virtual address space, even when a range is passed in. If so,
> 
> "Because the cache-maintenance wasn't needed, we can do the TLBI instead.
> In fact, the I-Cache line-size isn't relevant anymore, we can reduce
> the number of traps by producing a fake value.
> 
> "For user-space, the kernel's work is now to trap CTR_EL0 to hide DIC,
> and produce a fake IminLine. EL3 traps the now-necessary I-Cache
> maintenance and performs the inner-shareable-TLBI that makes everything
> better."
> 
> My interpretation of this is that we only need to do the synchronization dance once, at the end of the patching. But I guess we don't know exactly if we have an affected core or if the kernel workaround is in action.

@theRealAph @fisk 
As we have explicit synchronization for the patched code, I decided to run an experiment of deferred icache invalidation on Graviton 3(Neoverse V1).
Graviton 3 does not have Neoverse N1 bug. It has hardware dcache and icache coherence. Such full hardware coherence means all `ICache:invalidate` operations are just a banch:

dsb ish
isb

>From my experience of implementing spin pauses, we use `isb` for pauses. So our multiple `ICache:invalidate` are a bunch of pauses.

Without deferred icache invalidation (baseline):

Benchmark                       (accessedFieldCount)  (methodCount)  Mode  Cnt    Score    Error  Units
GCPatchingNmethodCost.fullGC                       0           5000  avgt    3   41.290 ±  7.596  ms/op
GCPatchingNmethodCost.fullGC                       2           5000  avgt    3   95.773 ±  6.059  ms/op
GCPatchingNmethodCost.fullGC                       4           5000  avgt    3  137.183 ± 12.896  ms/op
GCPatchingNmethodCost.fullGC                       8           5000  avgt    3  219.030 ± 19.101  ms/op
GCPatchingNmethodCost.systemGC                     0           5000  avgt    3   43.762 ±  3.818  ms/op
GCPatchingNmethodCost.systemGC                     2           5000  avgt    3   97.525 ±  8.434  ms/op
GCPatchingNmethodCost.systemGC                     4           5000  avgt    3  139.555 ± 17.159  ms/op
GCPatchingNmethodCost.systemGC                     8           5000  avgt    3  221.163 ±  8.908  ms/op
GCPatchingNmethodCost.youngGC                      0           5000  avgt    3    3.052 ±  2.823  ms/op
GCPatchingNmethodCost.youngGC                      2           5000  avgt    3   13.956 ±  1.984  ms/op
GCPatchingNmethodCost.youngGC                      4           5000  avgt    3   22.364 ±  0.626  ms/op
GCPatchingNmethodCost.youngGC                      8           5000  avgt    3   39.821 ±  0.241  ms/op

With deferred icache invalidation:

Benchmark                       (accessedFieldCount)  (methodCount)  Mode  Cnt    Score    Error  Units
GCPatchingNmethodCost.fullGC                       0           5000  avgt    3   41.212 ± 10.914  ms/op
GCPatchingNmethodCost.fullGC                       2           5000  avgt    3   83.059 ± 17.115  ms/op
GCPatchingNmethodCost.fullGC                       4           5000  avgt    3  110.061 ±  2.642  ms/op
GCPatchingNmethodCost.fullGC                       8           5000  avgt    3  161.202 ±  5.750  ms/op
GCPatchingNmethodCost.systemGC                     0           5000  avgt    3   44.061 ±  7.586  ms/op
GCPatchingNmethodCost.systemGC                     2           5000  avgt    3   84.262 ± 11.852  ms/op
GCPatchingNmethodCost.systemGC                     4           5000  avgt    3  112.317 ±  3.907  ms/op
GCPatchingNmethodCost.systemGC                     8           5000  avgt    3  163.684 ±  9.732  ms/op
GCPatchingNmethodCost.youngGC                      0           5000  avgt    3    2.949 ±  0.626  ms/op
GCPatchingNmethodCost.youngGC                      2           5000  avgt    3    9.997 ±  1.334  ms/op
GCPatchingNmethodCost.youngGC                      4           5000  avgt    3   14.953 ±  1.121  ms/op
GCPatchingNmethodCost.youngGC                      8           5000  avgt    3   23.966 ±  1.656  ms/op

Improvements:
- 2 fields accessed
   - Full GC: 13%
   - System GC: 14%
   - Young GC: 28% 
- 4 fields accessed
   - Full GC: 20%
   - System GC: 20%
   - Young GC: 33% 
- 8 fields accessed
   - Full GC: 26%
   - System GC: 26%
   - Young GC: 40%

-------------

PR Comment: https://git.openjdk.org/jdk/pull/28328#issuecomment-3585923078