RFR: 8370947: Mitigate Neoverse-N1 erratum 1542419 negative impact on GenZGC performance [v3]
Evgeny Astigeevich
eastigeevich at openjdk.org
Thu Nov 27 13:41:55 UTC 2025
On Tue, 25 Nov 2025 13:04:55 GMT, Andrew Haley <aph at openjdk.org> wrote:
>> Yeah patching all nmethods as one unit is basically equivalent to making the code cache processing a STW operation. Last time we processed the code cache STW was JDK 11. A dark place I don't want to go back to. It can get pretty big and mess up latency. So I'm in favour of limiting the fix and not re-introduce STW code cache processing.
>>
>> Otherwise yes you are correct; we perform synchronous cross modifying code with no assumptions about instruction cache coherency because we didn't trust it would actually work for all ARM implementations. Seems like that was a good bet. We rely on it on x64 still though.
>>
>> It's a bit surprising to me if they invalidate all TLB entries, effectively ripping out the entire virtual address space, even when a range is passed in. If so, a horrible alternative might be to use mprotect to temporarily remove execution permission on the affected per nmethod pages, and detect over shooting in the signal handler, resuming execution when execution privileges are then restored immediately after. That should limit the affected VA to close to what is actually invalidated. But it would look horrible.
>
>> It's a bit surprising to me if they invalidate all TLB entries, effectively ripping out the entire virtual address space, even when a range is passed in. If so,
>
> "Because the cache-maintenance wasn't needed, we can do the TLBI instead.
> In fact, the I-Cache line-size isn't relevant anymore, we can reduce
> the number of traps by producing a fake value.
>
> "For user-space, the kernel's work is now to trap CTR_EL0 to hide DIC,
> and produce a fake IminLine. EL3 traps the now-necessary I-Cache
> maintenance and performs the inner-shareable-TLBI that makes everything
> better."
>
> My interpretation of this is that we only need to do the synchronization dance once, at the end of the patching. But I guess we don't know exactly if we have an affected core or if the kernel workaround is in action.
@theRealAph @fisk
As we have explicit synchronization for the patched code, I decided to run an experiment of deferred icache invalidation on Graviton 3(Neoverse V1).
Graviton 3 does not have Neoverse N1 bug. It has hardware dcache and icache coherence. Such full hardware coherence means all `ICache:invalidate` operations are just a banch:
dsb ish
isb
>From my experience of implementing spin pauses, we use `isb` for pauses. So our multiple `ICache:invalidate` are a bunch of pauses.
Without deferred icache invalidation (baseline):
Benchmark (accessedFieldCount) (methodCount) Mode Cnt Score Error Units
GCPatchingNmethodCost.fullGC 0 5000 avgt 3 41.290 ± 7.596 ms/op
GCPatchingNmethodCost.fullGC 2 5000 avgt 3 95.773 ± 6.059 ms/op
GCPatchingNmethodCost.fullGC 4 5000 avgt 3 137.183 ± 12.896 ms/op
GCPatchingNmethodCost.fullGC 8 5000 avgt 3 219.030 ± 19.101 ms/op
GCPatchingNmethodCost.systemGC 0 5000 avgt 3 43.762 ± 3.818 ms/op
GCPatchingNmethodCost.systemGC 2 5000 avgt 3 97.525 ± 8.434 ms/op
GCPatchingNmethodCost.systemGC 4 5000 avgt 3 139.555 ± 17.159 ms/op
GCPatchingNmethodCost.systemGC 8 5000 avgt 3 221.163 ± 8.908 ms/op
GCPatchingNmethodCost.youngGC 0 5000 avgt 3 3.052 ± 2.823 ms/op
GCPatchingNmethodCost.youngGC 2 5000 avgt 3 13.956 ± 1.984 ms/op
GCPatchingNmethodCost.youngGC 4 5000 avgt 3 22.364 ± 0.626 ms/op
GCPatchingNmethodCost.youngGC 8 5000 avgt 3 39.821 ± 0.241 ms/op
With deferred icache invalidation:
Benchmark (accessedFieldCount) (methodCount) Mode Cnt Score Error Units
GCPatchingNmethodCost.fullGC 0 5000 avgt 3 41.212 ± 10.914 ms/op
GCPatchingNmethodCost.fullGC 2 5000 avgt 3 83.059 ± 17.115 ms/op
GCPatchingNmethodCost.fullGC 4 5000 avgt 3 110.061 ± 2.642 ms/op
GCPatchingNmethodCost.fullGC 8 5000 avgt 3 161.202 ± 5.750 ms/op
GCPatchingNmethodCost.systemGC 0 5000 avgt 3 44.061 ± 7.586 ms/op
GCPatchingNmethodCost.systemGC 2 5000 avgt 3 84.262 ± 11.852 ms/op
GCPatchingNmethodCost.systemGC 4 5000 avgt 3 112.317 ± 3.907 ms/op
GCPatchingNmethodCost.systemGC 8 5000 avgt 3 163.684 ± 9.732 ms/op
GCPatchingNmethodCost.youngGC 0 5000 avgt 3 2.949 ± 0.626 ms/op
GCPatchingNmethodCost.youngGC 2 5000 avgt 3 9.997 ± 1.334 ms/op
GCPatchingNmethodCost.youngGC 4 5000 avgt 3 14.953 ± 1.121 ms/op
GCPatchingNmethodCost.youngGC 8 5000 avgt 3 23.966 ± 1.656 ms/op
Improvements:
- 2 fields accessed
- Full GC: 13%
- System GC: 14%
- Young GC: 28%
- 4 fields accessed
- Full GC: 20%
- System GC: 20%
- Young GC: 33%
- 8 fields accessed
- Full GC: 26%
- System GC: 26%
- Young GC: 40%
-------------
PR Comment: https://git.openjdk.org/jdk/pull/28328#issuecomment-3585923078
More information about the hotspot-dev
mailing list