RFR: 8303737: C2: Load can bypass subtype check that enforces it's from the right object type

Wed Sep 6 15:04:20 UTC 2023

This patch has 3 test cases:

TestAddPChainMismatchedBase.java
TestAddPChainMismatchedBase2.java

Both fail with a "Base pointers must match" assert

TestLoadBypassesClassCast.java

fails with: assert(Universe::is_in_heap(result)) failed: object not in heap

All failures are related to 8139771 (Eliminating CastPP nodes at Phis
when they all come from a unique input may cause crash). The problem
there was that if a Phi merges say:

(Phi (CastPP in) in)

it was transformed to `in`. Doing that would cause the control
dependency from the `Phi` on the null check that guards the `CastPP`
to be dropped. As demonstrated by the test case of 8139771, a load
could then float above the null check that should have guaranteed the
object being read from was non null.

The fix for 8139771 was to transform:

(Phi (CastPP in) in)

to:

(CastPP in)

That `CastPP` inherits the type of the `Phi` and is flagged as a
special kind of cast node so it's not eliminated if the `CastPP`
doesn't narrow the type of its input. As a result some depending load
cannot float above the cast.

Because that was found to be too conservative, code was added to try
to eliminate cast node added as replacement of a `Phi` by looking for
a dominating cast with the same input and a narrower type that the
cast. The idea was that if there was some dominating null check then
we would not need the cast. As TestLoadBypassesClassCast.java shows,
this is flawed.

TestLoadBypassesClassCast.java demonstrates that what used to happen
with a null check (a load that bypasses a null check) can still happen
with a type check: in the case of the test, the load of `objectField`
from `o` happens before `o` is checked to be of type `A` (the holder
of the field). When `o` is not of type `A`, the result of the load is
never used by compiled code but if the gc runs while the result of the
load is live (as it does in the test case), gc code sees something
that's not a valid oop and crashes.

In the test case, the same logic is duplicated in 2 branches of
ifs. There are 2 loads of `objectField`, one in each copy of the
logic. After transformation, the `objectField` loads lose their
dependency on type checks for `A` and common above a safepoint causing
a load from something that may not be a `A` to be live across a
safepoint.

Now looking at one copy of the logic:

- right above the `objectField` load, a `Phi` (that merges the 2
  banches of `if (flag) {` and has type non null Object) is tested to
  be a subtype of `A`.

- the subtype checks is split throught `Phi` but the `CheckCastPP`
  stays at the merge point.

- The `Phi` is replaced by a `CastPP` to non null `Object`.

- The `CastPP` is replaced by a dominating `CastPP` to non null that's
  at the beginning of the method.

- The `if (flag)` goes away and the `CheckCastPP` becomes control
  dependent on the range check for the array load that's above. That
  range check is replaced by a dominating range check, itself at the
  beginning of the method. Now the `CheckCastPP` is above any subtype
  check. The load can float as high as the `CheckCastPP`.

The flaw here I think is the logic from 8139771 that assumes a cast
node can be replaced by any cast with a narrower or equal type. In
that case, the cast replaces a `Phi` that merges types not null `A`
and not null `Object` which as a result has type non null `Object`. It
would be fine to replace the cast with a dominating cast that's
narrower that the type of each input to the Phi (so in this case non
null `A` and non null `Object`). That's not the case for the cast of
the null check. The problem is when the `Phi` is transformed to a
cast, the types of its inputs are lost. What I propose in the fix is
to attach those types to the cast and change the logic that decides
whether a cast is redundant (whether because of the type of its input
or because of a dominating cast) so it goes over the list of attached
types.

How does that relate to the other 2 test cases that fail because in a
chain of `Addp` nodes, not all nodes have the same base (some casted,
some not)? What I found looking at those test cases is that the graph
contained some casts created as a result of the `PhiNode::Ideal` that
were "obviously" useless but that are conservatively kept. Those casts
were involved in a chain of transformations that cause the assert. The
new logic does a better job of removing the casts and make the
failures go away. So the change doesn't directly fix it but, at the
very least, it makes it less likely and I'm not sure there's more work
needed at this point unless we see that failure again.

-------------

Commit messages:
 - comments
 - fix & test

Changes: https://git.openjdk.org/jdk/pull/15595/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=15595&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8303737
  Stats: 495 lines in 8 files changed: 446 ins; 3 del; 46 mod
  Patch: https://git.openjdk.org/jdk/pull/15595.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/15595/head:pull/15595

PR: https://git.openjdk.org/jdk/pull/15595