RFR: 8308869: C2: use profile data in subtype checks when profile has more than one class

Thu Jun 8 11:34:07 UTC 2023

In this simple micro benchmark:

https://github.com/franz1981/java-puzzles/blob/main/src/main/java/red/hat/puzzles/polymorphism/RequireNonNullCheckcastScalability.java#L70

Performance drops sharply with polluted profile:

Benchmark                                         (typePollution)   Mode  Cnt     Score    Error   Units
RequireNonNullCheckcastScalability.isDuplicated1            false  thrpt   10  1453.372 ± 24.919  ops/us

to:

Benchmark                                         (typePollution)   Mode  Cnt   Score   Error   Units
RequireNonNullCheckcastScalability.isDuplicated1             true  thrpt   10  28.579 ± 2.280  ops/us

The test has 2 type checks to 2 different interfaces so caching with
`secondary_super_cache` doesn't help.

The micro-benchmark only uses 2 different concrete classes
(`DuplicatedContext` and `NonDuplicatedContext`) and they are recorded
in profile data at the type checks. But c2 only take advantage of
profile data at type checks if they report a single class.

What I propose is that the full blown type check expanded in
`Phase::gen_subtype_check()` takes advantage of profile data. So in
the case of the micro benchmark, before checking the
`secondary_super_cache`, generated code checks whether the object
being type checked is a `DuplicatedContext` or a
`NonDuplicatedContext`.

This works fairly well on this micro benchmark:

Benchmark                                         (typePollution)   Mode  Cnt    Score    Error   Units
RequireNonNullCheckcastScalability.isDuplicated1             true  thrpt   10  871.224 ± 20.750  ops/us

It also scales much better if there are multiple threads running the
same test (`secondary_super_cache` doesn't scale well: see
JDK-8180450).

Now if the micro-benchmark is changed according to the comment:

https://github.com/franz1981/java-puzzles/blob/d2d60af3d0dfe7a2567807395138edcb1d1c24f5/src/main/java/red/hat/puzzles/polymorphism/RequireNonNullCheckcastScalability.java#L62

so the type check hits in the `secondary_super_cache`, the current
code performs much better:

Benchmark                                         (typePollution)   Mode  Cnt    Score    Error   Units
RequireNonNullCheckcastScalability.isDuplicated1             true  thrpt   10  871.224 ± 20.750  ops/us

but leveraging profiling as explained above performs even better:

Benchmark                                         (typePollution)   Mode  Cnt     Score    Error   Units
RequireNonNullCheckcastScalability.isDuplicated1             true  thrpt   10  1165.474 ± 70.171  ops/us

I think it's actually likely that there's a performance advantage even
if profiling sees more than 2 types at a type check unless the profile
is heavily polluted. The problem is that, the way current profile data
is collected, we can't tell if the profile is heavily polluted
because, unlike profiling at virtual calls, there's no counter for non
recorded types. The `count` field is used to count failed type checks
instead. JVMCI added a `nonprofiled_count`. I thought about using that
one but it seems after looking at the way c2 uses the failed type
check count that it would be simpler to simply collect profile data at
type checks the way it's done at virtual calls. Indeed, C2 uses the
unique class reported by profile data only if there was no failed type
checks recorded in profile data but:

- at checkcasts, it also checks that it can prove the check would
  statically fold. That last check seems to be the one that matters.

- at instanceof, AFAICT, a profiled type that causes the instanceof to
  fail is as valuable as one that makes it succeed so it would be
  better to ignore failures reported by profiling.

I also discussed this briefly with Tom and he said graal doesn't need
the failed type check count.

So, in the patch I propose, I changed the way profile data is
collected so it works the same it does at virtual call. If this patch
is accepted, I'll need help with platforms other than x86 and aarch64.

I also modified the JVMCI code. BTW, I also wonder if
`VirtualCallData.getMethodProfile()` is not obsolete.

Finally, I changed `Phase::gen_subtype_check()` so it emits the extra
checks. That method is now called at macro expansion when profile data
is not longer available. So I attached profile data to the
`SubTypeCheck` node. For each profile data entry, 2 edges are added:
one for the klass, one for the profile frequency.

Because `SubTypeCheck` now has extra edges, it can happen that 2
`SubTypeCheck` nodes that perform the same subtype check don't common
during `IGVN` which can get in the way of some optimizations. I had to
make some adjustments to the logic of split if and code that looks for
dominating identical checks because of that.

-------------

Commit messages:
 - white spaces
 - fix & test

Changes: https://git.openjdk.org/jdk/pull/14375/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=14375&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8308869
  Stats: 849 lines in 24 files changed: 523 ins; 232 del; 94 mod
  Patch: https://git.openjdk.org/jdk/pull/14375.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/14375/head:pull/14375

PR: https://git.openjdk.org/jdk/pull/14375