RFR: 8357258: x86: Improve receiver type profiling reliability [v5]

Aleksey Shipilev shade at openjdk.org
Mon Dec 1 13:04:10 UTC 2025


On Wed, 26 Nov 2025 15:55:38 GMT, Aleksey Shipilev <shade at openjdk.org> wrote:

>> See the bug for discussion what issues current machinery has. 
>> 
>> This PR executes the plan outlined in the bug:
>>  1. Common the receiver type profiling code in interpreter and C1
>>  2. Rewrite receiver type profiling code to only do atomic receiver slot installations
>>  3. Trim `C1OptimizeVirtualCallProfiling` to only claim slots when receiver is installed 
>> 
>> This PR does _not_ do atomic counter updates themselves, as it may have much wider performance implications, including regressions. This PR should be at least performance neutral.
>> 
>> Additional testing:
>>   - [x] Linux x86_64 server fastdebug, `compiler/`
>>   - [x] Linux x86_64 server fastdebug, `all`
>
> Aleksey Shipilev has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains 15 commits:
> 
>  - Merge branch 'master' into JDK-8357258-x86-c1-optimize-virt-calls
>  - Tighten up some more
>  - Offset is always rscratch1, no need to save it
>  - Grossly simplify register shuffling
>  - More asserts
>  - More comment touchups
>  - Inline code comments
>  - Mention the updater in ReceiverTypeData
>  - type_profile -> profile_receiver_type
>  - Stylistic: remove redundant assert
>  - ... and 5 more: https://git.openjdk.org/jdk/compare/c028369d...c441209a

Oh, all right! This made me realize we actually have a secondary "fast" case: receiver is not found, but profile is full. This is pretty frequent with `TypeProfileWidth=2`. In that case, we are doing way too much stuff, anticipating receiver slot installation that would never actually come. Specializing for that case costs significantly fewer loads, and gets the code much more pipelined; I suspect that because tight loops that _do not_ have CAS-es in them are uop-cached more readily.

We now lose "only" 0.5ns in this test:


Benchmark                                                (randomized)  Mode  Cnt    Score   Error      Units

# Baseline
InterfaceCalls.test2ndInt5Types                                 false  avgt   12   16.945 ±  0.079      ns/op
InterfaceCalls.test2ndInt5Types:L1-dcache-load-misses           false  avgt    3    0.076 ±  2.187       #/op
InterfaceCalls.test2ndInt5Types:L1-dcache-loads                 false  avgt    3   88.738 ±  0.416       #/op
InterfaceCalls.test2ndInt5Types:branch-misses                   false  avgt    3    0.007 ±  0.003       #/op
InterfaceCalls.test2ndInt5Types:branches                        false  avgt    3   49.122 ±  0.353       #/op
InterfaceCalls.test2ndInt5Types:cycles                          false  avgt    3   57.147 ±  1.698       #/op
InterfaceCalls.test2ndInt5Types:instructions                    false  avgt    3  247.443 ±  1.531       #/op

# Old PR version
InterfaceCalls.test2ndInt5Types                                 false  avgt   12   22.513 ±  0.208      ns/op
InterfaceCalls.test2ndInt5Types:L1-dcache-load-misses           false  avgt    3    0.012 ±  0.072       #/op
InterfaceCalls.test2ndInt5Types:L1-dcache-loads                 false  avgt    3  108.446 ± 13.975       #/op  ; +20 loads
InterfaceCalls.test2ndInt5Types:branch-misses                   false  avgt    3    0.407 ±  0.010       #/op
InterfaceCalls.test2ndInt5Types:branches                        false  avgt    3   54.102 ±  0.403       #/op  ; +5 branches
InterfaceCalls.test2ndInt5Types:cycles                          false  avgt    3   75.938 ±  5.043       #/op  ; +19 cycles
InterfaceCalls.test2ndInt5Types:instructions                    false  avgt    3  280.194 ±  5.758       #/op  ; +32 instructions

# New PR version
InterfaceCalls.test2ndInt5Types                                 false  avgt   12   17.441 ±  0.287      ns/op
InterfaceCalls.test2ndInt5Types:L1-dcache-load-misses           false  avgt    3    0.009 ±  0.072       #/op
InterfaceCalls.test2ndInt5Types:L1-dcache-loads                 false  avgt    3   88.803 ±  1.401       #/op
InterfaceCalls.test2ndInt5Types:branch-misses                   false  avgt    3    0.009 ±  0.062       #/op
InterfaceCalls.test2ndInt5Types:branches                        false  avgt    3   52.945 ±  0.752       #/op  ; +4 branches
InterfaceCalls.test2ndInt5Types:cycles                          false  avgt    3   58.866 ± 15.379       #/op  ; +2 cycles
InterfaceCalls.test2ndInt5Types:instructions                    false  avgt    3  272.838 ±  1.665       #/op  ; +28 instructions


The code is in new commits, passes `hotspot:tier1`, running more tests now.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/25305#issuecomment-3596428656


More information about the hotspot-dev mailing list