RFR: 8308869: C2: use profile data in subtype checks when profile has more than one class [v6]

Fri Jun 23 18:51:08 UTC 2023

On Mon, 19 Jun 2023 12:22:56 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> In this simple micro benchmark:
>> 
>> https://github.com/franz1981/java-puzzles/blob/main/src/main/java/red/hat/puzzles/polymorphism/RequireNonNullCheckcastScalability.java#L70
>> 
>> Performance drops sharply with polluted profile:
>> 
>> 
>> Benchmark                                         (typePollution)   Mode  Cnt     Score    Error   Units
>> RequireNonNullCheckcastScalability.isDuplicated1            false  thrpt   10  1453.372 ± 24.919  ops/us
>> 
>> 
>> to:
>> 
>> 
>> Benchmark                                         (typePollution)   Mode  Cnt   Score   Error   Units
>> RequireNonNullCheckcastScalability.isDuplicated1             true  thrpt   10  28.579 ± 2.280  ops/us
>> 
>> 
>> The test has 2 type checks to 2 different interfaces so caching with
>> `secondary_super_cache` doesn't help.
>> 
>> The micro-benchmark only uses 2 different concrete classes
>> (`DuplicatedContext` and `NonDuplicatedContext`) and they are recorded
>> in profile data at the type checks. But c2 only take advantage of
>> profile data at type checks if they report a single class.
>> 
>> What I propose is that the full blown type check expanded in
>> `Phase::gen_subtype_check()` takes advantage of profile data. So in
>> the case of the micro benchmark, before checking the
>> `secondary_super_cache`, generated code checks whether the object
>> being type checked is a `DuplicatedContext` or a
>> `NonDuplicatedContext`.
>> 
>> This works fairly well on this micro benchmark:
>> 
>> 
>> Benchmark                                         (typePollution)   Mode  Cnt    Score    Error   Units
>> RequireNonNullCheckcastScalability.isDuplicated1             true  thrpt   10  871.224 ± 20.750  ops/us
>> 
>> 
>> It also scales much better if there are multiple threads running the
>> same test (`secondary_super_cache` doesn't scale well: see
>> JDK-8180450).
>> 
>> Now if the micro-benchmark is changed according to the comment:
>> 
>> https://github.com/franz1981/java-puzzles/blob/d2d60af3d0dfe7a2567807395138edcb1d1c24f5/src/main/java/red/hat/puzzles/polymorphism/RequireNonNullCheckcastScalability.java#L62
>> 
>> so the type check hits in the `secondary_super_cache`, the current
>> code performs much better:
>> 
>> 
>> Benchmark                                         (typePollution)   Mode  Cnt    Score    Error   Units
>> RequireNonNullCheckcastScalability.isDuplicated1             true  thrpt   10  871.224 ± 20.750  ops/us
>> 
>> 
>> but leveraging profiling as explained above performs even better:
>> 
>> 
>> Benchmark                   ...
>
> Roland Westrelin has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains eight additional commits since the last revision:
> 
>  - more test failures
>  - Merge branch 'master' into JDK-8308869
>  - whitespaces
>  - test failures
>  - review
>  - 32 bit fix
>  - white spaces
>  - fix & test

Overall, I'd prefer to leave commoning considerations for a separate enhancement. 

Embedding `ciCallProfile` looks to me much cleaner than exploding its content into node inputs.

Having profile info explicitly fed into `SubTypeCheck` as node inputs in practice defeats any possible sharing unless the nodes are constructed from the very same profile data. The types, their order, and frequencies have to perfectly match in order for commoning to happen. You already have  `IfNode::same_condition()` to alleviate some of the effects of broken sharing.

When you embed profiling info you are left with a choice how to common nodes (whether to take profiling info into account or not). But if you simply ignore it until macro expansion, the behavior will stay the same as it is now.

I prefer the patch to be focused on slow path case (reduce the frequency of secondary super cache checks & updates) and leave the rest for future considerations.  

As an example, it's still an open question for me should `IfNode::search_identical()` take profile info into account. Current patch ignores profile-related info (`IfNode::same_condition()` check), but maybe it is worth merging the profiles instead?  

> Some SubTypeCheck nodes have no profile data associated with them.

I don't consider footprint as an issue here. `SubTypeCheck`s are relatively rare and `ciCallProfile` size is quite small for any practical morphism limits. Additional profiling may introduce more about 1-2 additional slots (rather than 10s or 100s) and the main footprint hit will be on runtime side (in MDOs). 

> Is that out of concern that getting the code done on all platforms will be too complicated?

It does look like an excessive requirement, but I'm not too much concerned about it. If you think it's better to get the full support all at once, I'm perfectly fine with that. It just seems cleaner to refine profiling part separately. There are open questions which may be well out of scope for the proposed enhancement. 

For example, while `checkcast`/`aastore` behave very similarly to `invokevirtual`/`invokeinterface` (very low rate of failures), `instanceof` is different and can expose very high rates of failures (esp. in case of chained `instanceof` checks). Should we continue profiling for that? (Can C2 benefit from such info? I believe so: we could skip SSC check if failure rate is too high.) 

Also, I refrained from commenting on naming, but `ciCallProfile` does look confusing when it comes to `checkcast`,  `aastore`, and `instanceof` cases.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14375#issuecomment-1604712395