RFR: 8331658: secondary_super_cache does not scale well: C1 [v2]
Vladimir Ivanov
vlivanov at openjdk.org
Thu Jun 6 06:05:55 UTC 2024
On Wed, 29 May 2024 09:32:41 GMT, Andrew Haley <aph at openjdk.org> wrote:
>> This is the C1 version of [JDK-8180450](https://bugs.openjdk.org/browse/JDK-8180450).
>>
>> The new logic in this PR is as simple as I can make it. It is a somewhat-simplified version of the C2 change in [JDK-8180450](https://bugs.openjdk.org/browse/JDK-8180450). In order to reduce risk I haven't touched the existing slow subtype stub.
>> The register allocation logic in the existing code is pretty gnarly, and I have no desire to break anything at this point in the release cycle, so I have allocated just one register more than the existing code does.
>>
>> Performance is pretty good. Before and after:
>>
>> x64, AMD 2950X, 8 cores:
>>
>>
>> Benchmark Mode Cnt Score Error Units
>> SecondarySuperCacheHits.test avgt 5 0.959 ± 0.091 ns/op
>> SecondarySuperCacheInterContention.test avgt 5 42.931 ± 6.951 ns/op
>> SecondarySuperCacheInterContention.test:t1 avgt 5 42.397 ± 7.708 ns/op
>> SecondarySuperCacheInterContention.test:t2 avgt 5 43.466 ± 8.238 ns/op
>> SecondarySuperCacheIntraContention.test avgt 5 74.660 ± 0.127 ns/op
>>
>> SecondarySuperCacheHits.test avgt 5 1.480 ± 0.077 ns/op
>> SecondarySuperCacheInterContention.test avgt 5 1.461 ± 0.063 ns/op
>> SecondarySuperCacheInterContention.test:t1 avgt 5 1.767 ± 0.078 ns/op
>> SecondarySuperCacheInterContention.test:t2 avgt 5 1.155 ± 0.052 ns/op
>> SecondarySuperCacheIntraContention.test avgt 5 1.421 ± 0.002 ns/op
>>
>> AArch64, Mac M3, 8 cores:
>>
>>
>> Benchmark Mode Cnt Score Error Units
>> SecondarySuperCacheHits.test avgt 5 0.835 ± 0.021 ns/op
>> SecondarySuperCacheInterContention.test avgt 5 74.078 ± 18.095 ns/op
>> SecondarySuperCacheInterContention.test:t1 avgt 5 81.863 ± 42.492 ns/op
>> SecondarySuperCacheInterContention.test:t2 avgt 5 66.293 ± 11.254 ns/op
>> SecondarySuperCacheIntraContention.test avgt 5 335.563 ± 6.171 ns/op
>>
>> SecondarySuperCacheHits.test avgt 5 1.212 ± 0.004 ns/op
>> SecondarySuperCacheInterContention.test avgt 5 0.871 ± 0.002 ns/op
>> SecondarySuperCacheInterContention.test:t1 avgt 5 0.626 ± 0.003 ns/op
>> SecondarySuperCacheInterContention.test:t2 avgt 5 1.115 ± 0.006 ns/op
>> SecondarySuperCacheIntraContention.test avgt 5 0.696 ± 0.001 ns/op
>>
>>
>>
>> The first test, `SecondarySuperCacheHits`, showns a small regression. It's...
>
> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
>
> JDK-8331658: secondary_super_cache does not scale well: C1
Thinking more about the proposal itself (JDK-8331658) I'm curious how relevant scalability issues with SSC are for Client (C1-only) VM. I'd expect it to be deployed in constrained environments where contention has much smaller effects (if present at all). Maybe it's fine to leave SSC as is in Client VM and focus performance work on Tiered VM?
> lookup_secondary_supers_table needs to use fixed registers, and quite a lot of them. This patch is a version of the table lookup that uses as few registers as possible, and none of them are fixed.
The main reason why `lookup_secondary_supers_table` uses pre-defined registers is calling conventions between fast path checks and the stub on slow path. If slow path is inlined, the register set can be chosen arbitrarily. Still, I agree that table lookup needs more scratch registers to operate.
FTR `MacroAssembler::check_klass_subtype_slow_path` also has some constraints (at least, on x86), but that's because it relies on `SCAS` instruction. Still, `MacroAssembler::check_klass_subtype_slow_path` is used in different contexts with wildly varying set of available registers (I tried to gather some data on that during my earlier experiments [1]). It heavily relies on spilling to shuffle values or allocate scratch registers when needed. And, speaking of C1, the arguments for subtype check slow path are also passed on stack to simplify implementation. So, performing more spills per se doesn't look like a show-stopper (when it happens outside C2-generated code).
[1] https://github.com/iwanowww/jdk/blob/ssc.cuckoo.2seed/src/hotspot/cpu/x86/macroAssembler_x86.cpp#L4441
-------------
PR Comment: https://git.openjdk.org/jdk/pull/19426#issuecomment-2151474456
More information about the hotspot-compiler-dev
mailing list