RFR: 8357258: x86: Improve receiver type profiling reliability [v5]
Aleksey Shipilev
shade at openjdk.org
Mon Dec 1 13:04:10 UTC 2025
On Wed, 26 Nov 2025 15:55:38 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:
>> See the bug for discussion what issues current machinery has.
>>
>> This PR executes the plan outlined in the bug:
>> 1. Common the receiver type profiling code in interpreter and C1
>> 2. Rewrite receiver type profiling code to only do atomic receiver slot installations
>> 3. Trim `C1OptimizeVirtualCallProfiling` to only claim slots when receiver is installed
>>
>> This PR does _not_ do atomic counter updates themselves, as it may have much wider performance implications, including regressions. This PR should be at least performance neutral.
>>
>> Additional testing:
>> - [x] Linux x86_64 server fastdebug, `compiler/`
>> - [x] Linux x86_64 server fastdebug, `all`
>
> Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 15 commits:
>
> - Merge branch 'master' into JDK-8357258-x86-c1-optimize-virt-calls
> - Tighten up some more
> - Offset is always rscratch1, no need to save it
> - Grossly simplify register shuffling
> - More asserts
> - More comment touchups
> - Inline code comments
> - Mention the updater in ReceiverTypeData
> - type_profile -> profile_receiver_type
> - Stylistic: remove redundant assert
> - ... and 5 more: https://git.openjdk.org/jdk/compare/c028369d...c441209a
Oh, all right! This made me realize we actually have a secondary "fast" case: receiver is not found, but profile is full. This is pretty frequent with `TypeProfileWidth=2`. In that case, we are doing way too much stuff, anticipating receiver slot installation that would never actually come. Specializing for that case costs significantly fewer loads, and gets the code much more pipelined; I suspect that because tight loops that _do not_ have CAS-es in them are uop-cached more readily.
We now lose "only" 0.5ns in this test:
Benchmark (randomized) Mode Cnt Score Error Units
# Baseline
InterfaceCalls.test2ndInt5Types false avgt 12 16.945 ± 0.079 ns/op
InterfaceCalls.test2ndInt5Types:L1-dcache-load-misses false avgt 3 0.076 ± 2.187 #/op
InterfaceCalls.test2ndInt5Types:L1-dcache-loads false avgt 3 88.738 ± 0.416 #/op
InterfaceCalls.test2ndInt5Types:branch-misses false avgt 3 0.007 ± 0.003 #/op
InterfaceCalls.test2ndInt5Types:branches false avgt 3 49.122 ± 0.353 #/op
InterfaceCalls.test2ndInt5Types:cycles false avgt 3 57.147 ± 1.698 #/op
InterfaceCalls.test2ndInt5Types:instructions false avgt 3 247.443 ± 1.531 #/op
# Old PR version
InterfaceCalls.test2ndInt5Types false avgt 12 22.513 ± 0.208 ns/op
InterfaceCalls.test2ndInt5Types:L1-dcache-load-misses false avgt 3 0.012 ± 0.072 #/op
InterfaceCalls.test2ndInt5Types:L1-dcache-loads false avgt 3 108.446 ± 13.975 #/op ; +20 loads
InterfaceCalls.test2ndInt5Types:branch-misses false avgt 3 0.407 ± 0.010 #/op
InterfaceCalls.test2ndInt5Types:branches false avgt 3 54.102 ± 0.403 #/op ; +5 branches
InterfaceCalls.test2ndInt5Types:cycles false avgt 3 75.938 ± 5.043 #/op ; +19 cycles
InterfaceCalls.test2ndInt5Types:instructions false avgt 3 280.194 ± 5.758 #/op ; +32 instructions
# New PR version
InterfaceCalls.test2ndInt5Types false avgt 12 17.441 ± 0.287 ns/op
InterfaceCalls.test2ndInt5Types:L1-dcache-load-misses false avgt 3 0.009 ± 0.072 #/op
InterfaceCalls.test2ndInt5Types:L1-dcache-loads false avgt 3 88.803 ± 1.401 #/op
InterfaceCalls.test2ndInt5Types:branch-misses false avgt 3 0.009 ± 0.062 #/op
InterfaceCalls.test2ndInt5Types:branches false avgt 3 52.945 ± 0.752 #/op ; +4 branches
InterfaceCalls.test2ndInt5Types:cycles false avgt 3 58.866 ± 15.379 #/op ; +2 cycles
InterfaceCalls.test2ndInt5Types:instructions false avgt 3 272.838 ± 1.665 #/op ; +28 instructions
The code is in new commits, passes `hotspot:tier1`, running more tests now.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/25305#issuecomment-3596428656
More information about the hotspot-dev
mailing list