RFR: 8331658: secondary_super_cache does not scale well: C1 [v2]

Thu Jun 6 06:05:55 UTC 2024

On Wed, 29 May 2024 09:32:41 GMT, Andrew Haley <aph at openjdk.org> wrote:

>> This is the C1 version of [JDK-8180450](https://bugs.openjdk.org/browse/JDK-8180450).
>> 
>> The new logic in this PR is as simple as I can make it. It is a somewhat-simplified version of the C2 change in [JDK-8180450](https://bugs.openjdk.org/browse/JDK-8180450). In order to reduce risk I haven't touched the existing slow subtype stub.
>> The register allocation logic in the existing code is pretty gnarly, and I have no desire to break anything at this point in the release cycle, so I have allocated just one register more than the existing code does.
>> 
>> Performance is pretty good. Before and after:
>> 
>> x64, AMD 2950X, 8 cores:
>> 
>> 
>> Benchmark                                   Mode  Cnt   Score   Error  Units
>> SecondarySuperCacheHits.test                avgt    5   0.959 ± 0.091  ns/op
>> SecondarySuperCacheInterContention.test     avgt    5  42.931 ± 6.951  ns/op
>> SecondarySuperCacheInterContention.test:t1  avgt    5  42.397 ± 7.708  ns/op
>> SecondarySuperCacheInterContention.test:t2  avgt    5  43.466 ± 8.238  ns/op
>> SecondarySuperCacheIntraContention.test     avgt    5  74.660 ± 0.127  ns/op
>> 
>> SecondarySuperCacheHits.test                avgt    5  1.480 ± 0.077  ns/op
>> SecondarySuperCacheInterContention.test     avgt    5  1.461 ± 0.063  ns/op
>> SecondarySuperCacheInterContention.test:t1  avgt    5  1.767 ± 0.078  ns/op
>> SecondarySuperCacheInterContention.test:t2  avgt    5  1.155 ± 0.052  ns/op
>> SecondarySuperCacheIntraContention.test     avgt    5  1.421 ± 0.002  ns/op
>> 
>> AArch64, Mac M3, 8 cores:
>> 
>> 
>> Benchmark                     Mode  Cnt  Score   Error  Units
>> SecondarySuperCacheHits.test                avgt    5    0.835 ±  0.021  ns/op
>> SecondarySuperCacheInterContention.test     avgt    5   74.078 ± 18.095  ns/op
>> SecondarySuperCacheInterContention.test:t1  avgt    5   81.863 ± 42.492  ns/op
>> SecondarySuperCacheInterContention.test:t2  avgt    5   66.293 ± 11.254  ns/op
>> SecondarySuperCacheIntraContention.test     avgt    5  335.563 ±  6.171  ns/op
>> 
>> SecondarySuperCacheHits.test                avgt    5  1.212 ±  0.004  ns/op
>> SecondarySuperCacheInterContention.test     avgt    5  0.871 ±  0.002  ns/op
>> SecondarySuperCacheInterContention.test:t1  avgt    5  0.626 ±  0.003  ns/op
>> SecondarySuperCacheInterContention.test:t2  avgt    5  1.115 ±  0.006  ns/op
>> SecondarySuperCacheIntraContention.test     avgt    5  0.696 ±  0.001  ns/op
>> 
>> 
>> 
>> The first test, `SecondarySuperCacheHits`, showns a small regression. It's...
>
> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
> 
>   JDK-8331658: secondary_super_cache does not scale well: C1

Thinking more about the proposal itself (JDK-8331658) I'm curious how relevant scalability issues with SSC are for Client (C1-only) VM. I'd expect it to be deployed in constrained environments where contention has much smaller effects (if present at all). Maybe it's fine to leave SSC as is in Client VM and focus performance work on Tiered VM?

> lookup_secondary_supers_table needs to use fixed registers, and quite a lot of them. This patch is a version of the table lookup that uses as few registers as possible, and none of them are fixed.

The main reason why `lookup_secondary_supers_table` uses pre-defined registers is calling conventions between fast path checks and the stub on slow path. If slow path is inlined, the register set can be chosen arbitrarily. Still, I agree that table lookup needs more scratch registers to operate.

FTR `MacroAssembler::check_klass_subtype_slow_path` also has some constraints (at least, on x86), but that's because it relies on `SCAS` instruction. Still, `MacroAssembler::check_klass_subtype_slow_path` is used in different contexts with wildly varying set of available registers (I tried to gather some data on that during my earlier experiments [1]). It heavily relies on spilling to shuffle values or allocate scratch registers when needed. And, speaking of C1, the arguments for subtype check slow path are also passed on stack to simplify implementation. So, performing more spills per se doesn't look like a show-stopper (when it happens outside C2-generated code).  

[1] https://github.com/iwanowww/jdk/blob/ssc.cuckoo.2seed/src/hotspot/cpu/x86/macroAssembler_x86.cpp#L4441

-------------

PR Comment: https://git.openjdk.org/jdk/pull/19426#issuecomment-2151474456