RFR: 8331658: secondary_super_cache does not scale well: C1 [v2]
Andrew Haley
aph at openjdk.org
Thu Jun 6 09:11:57 UTC 2024
On Wed, 29 May 2024 09:32:41 GMT, Andrew Haley <aph at openjdk.org> wrote:
>> This is the C1 version of [JDK-8180450](https://bugs.openjdk.org/browse/JDK-8180450).
>>
>> The new logic in this PR is as simple as I can make it. It is a somewhat-simplified version of the C2 change in [JDK-8180450](https://bugs.openjdk.org/browse/JDK-8180450). In order to reduce risk I haven't touched the existing slow subtype stub.
>> The register allocation logic in the existing code is pretty gnarly, and I have no desire to break anything at this point in the release cycle, so I have allocated just one register more than the existing code does.
>>
>> Performance is pretty good. Before and after:
>>
>> x64, AMD 2950X, 8 cores:
>>
>>
>> Benchmark Mode Cnt Score Error Units
>> SecondarySuperCacheHits.test avgt 5 0.959 ± 0.091 ns/op
>> SecondarySuperCacheInterContention.test avgt 5 42.931 ± 6.951 ns/op
>> SecondarySuperCacheInterContention.test:t1 avgt 5 42.397 ± 7.708 ns/op
>> SecondarySuperCacheInterContention.test:t2 avgt 5 43.466 ± 8.238 ns/op
>> SecondarySuperCacheIntraContention.test avgt 5 74.660 ± 0.127 ns/op
>>
>> SecondarySuperCacheHits.test avgt 5 1.480 ± 0.077 ns/op
>> SecondarySuperCacheInterContention.test avgt 5 1.461 ± 0.063 ns/op
>> SecondarySuperCacheInterContention.test:t1 avgt 5 1.767 ± 0.078 ns/op
>> SecondarySuperCacheInterContention.test:t2 avgt 5 1.155 ± 0.052 ns/op
>> SecondarySuperCacheIntraContention.test avgt 5 1.421 ± 0.002 ns/op
>>
>> AArch64, Mac M3, 8 cores:
>>
>>
>> Benchmark Mode Cnt Score Error Units
>> SecondarySuperCacheHits.test avgt 5 0.835 ± 0.021 ns/op
>> SecondarySuperCacheInterContention.test avgt 5 74.078 ± 18.095 ns/op
>> SecondarySuperCacheInterContention.test:t1 avgt 5 81.863 ± 42.492 ns/op
>> SecondarySuperCacheInterContention.test:t2 avgt 5 66.293 ± 11.254 ns/op
>> SecondarySuperCacheIntraContention.test avgt 5 335.563 ± 6.171 ns/op
>>
>> SecondarySuperCacheHits.test avgt 5 1.212 ± 0.004 ns/op
>> SecondarySuperCacheInterContention.test avgt 5 0.871 ± 0.002 ns/op
>> SecondarySuperCacheInterContention.test:t1 avgt 5 0.626 ± 0.003 ns/op
>> SecondarySuperCacheInterContention.test:t2 avgt 5 1.115 ± 0.006 ns/op
>> SecondarySuperCacheIntraContention.test avgt 5 0.696 ± 0.001 ns/op
>>
>>
>>
>> The first test, `SecondarySuperCacheHits`, showns a small regression. It's...
>
> Andrew Haley has updated the pull request incrementally with one additional commit since the last revision:
>
> JDK-8331658: secondary_super_cache does not scale well: C1
I agree with most of what you write. Some responses to points made:
> Thinking more about the proposal itself (JDK-8331658) I'm curious how relevant scalability issues with SSC are for Client (C1-only) VM. I'd expect it to be deployed in constrained environments where contention has much smaller effects (if present at all). Maybe it's fine to leave SSC as is in Client VM and focus performance work on Tiered VM?
Maybe, but it's hard to speculate well about a bunch of code we'll never see.
The end goal, I hope, is to remove the `secondary_super_cache` field altogether because it does nothing useful. Apart from anything else, we'd remove some cruft in the VM. So, C1, interpreter, and runtime is next.
> > lookup_secondary_supers_table needs to use fixed registers, and quite a lot of them. This patch is a version of the table lookup that uses as few registers as possible, and none of them are fixed.
>
> The main reason why `lookup_secondary_supers_table` uses pre-defined registers is calling conventions between fast path checks and the stub on slow path. If slow path is inlined, the register set can be chosen arbitrarily.
Certainly.
> Still, I agree that table lookup needs more scratch registers to operate.
>
> FTR `MacroAssembler::check_klass_subtype_slow_path` also has some constraints (at least, on x86), but that's because it relies on `SCAS` instruction. Still, `MacroAssembler::check_klass_subtype_slow_path` is used in different contexts with wildly varying set of available registers (I tried to gather some data on that during my earlier experiments [1]). It heavily relies on spilling to shuffle values or allocate scratch registers when needed.
Indeed. And when the actual work done by the table lookup takes on the order of a nanosecond, it seems to me to be rather disproportionate to surround it by shuffles. But yes, that works. I'm doing some experiments which seem to show that even with some spilling, hash table lookup is still a win in such cases. That's a bit surprising to me, but it's what the measurements seem to say.
> And, speaking of C1, the arguments for subtype check slow path are also passed on stack to simplify implementation. So, performing more spills per se doesn't look like a show-stopper (when it happens outside C2-generated code).
Sure, I get that, but I'm trying to make the common cases perform as well as I can. I get it, there's a balance to be struck between the complexity of the VM and the efficiency of the runtime code, but I think the best way to reduce complexity is to get rid of all uses of the `secondary_super_cache` field.
I'll come back when I have more.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/19426#issuecomment-2151789488
More information about the hotspot-compiler-dev
mailing list