Update on PEA in C2 (Episode 5)

Mon Aug 14 19:21:07 UTC 2023

Hi,

I would like update what we have done in the past month.

Previously, we mentioned that we plan to materialize an object using
DFS. We have completed this and it fixed 7 jtreg failures due to
object composition. We expected to fix performance issue of
CounterUppercase.java too. Unfortunately, it doesn’t pan out. We will
cover it later.

We set up a custom workflow using GHA. The purpose is to transparently
track the status of PEA_beta branch.

The workflow consists of 3 major stages.
1. Build a fastdebug binary on linux-x64
2. Run smoke tests and CTW 'java.base' module
3. Run tier1 tests. We break it down into 3 concurrent tasks: hotspot-tier1,
jdk-tier1 and langtools-tier1.

Since last time we reported, we have fixed all regression of
hotspot:tier1. Besides the object composition issue, we also discover
the following issues:
1. It’s possible that we have to materialize an
object which has unbalanced monitor
counter. Eg. https://github.com/navyxliu/jdk/blob/PEA_beta/PEA/MatInMonitor.java#L6
We just workaround this case by marking the object Escaped at bytecode
MonitorEnter.

2. Some intrinsics carry side-effect of memory. For
instance, Unsafe.compareAndSetReference(). We materialize all object
references for all intrinsics as if they are non-inlined function
calls. (This is too conservative, Object::hashCode() doesn't have any
side-effect of 'this'. We will loosen this constraint)

3. PredictedCallGenerator introduces an if-else construct based on
speculation. PEA materialization may take place in either branch so we
need to merge the allocation state for that.

Besides inter-procedural parse and deoptimization, we found the 3rd
source of bugs by summarizing issues above. Some ideal nodes that are
not directly from bytecode parsing. Since we embed PEA in c2 parse, we
depend on Parse to capture Java-object semantic. 2) above bypasses the
bytecodes of intrinsics, so we fail to capture the side-effect,
aka. escaping points. The if-else ideal nodes of 3) are not from
bytecodes either. An invokevirtual or invokeinterface generates them
because of UseTypeProfile. We need to pay more attention to area in
the next bug hunting.

Because hotspot:tier1 is clear. We will focus on tier1 jtreg tests.
In the latest
run(https://github.com/navyxliu/jdk/actions/runs/5827349911), we still
have 9 failures in jdk:tier1 and 109 failures in langtools:tier1. It
looks like C2 PEA has problem to deal with methodhandle.  We also
start running dacapo. We encounter exceptions in h2 and
lusearch/luindex. We are looking into them.

I would like to continue to discuss CounterUppercase.java. We still
suffer from duplicated allocation issue after we deploy DFS
materialization. The problem is that we punt object elimination to C2
optimizer. It turns out C2 optimizer can’t eliminate cyclic object
graph easily(https://bugs.openjdk.org/browse/JDK-8314179). By doing
DFS materialization in Uppercase.java, we leave behind a useless
object graph just like the attachment we upload. Is there a simple
solution for this case?

We have 2 ideas on C2 PEA side. The 1st one is a workaround and could
be a short-term remedy. When we realize we end up with redundant
allocation (-XX:+PEAParanoid), we recompile the current compilation
unit without PEA. We observe that 'C2Compiler::compile_method' has
deployed a retry mechanism for a few reasons,
eg. subsume_loads. The 2nd idea is that we bring back passive
materialization and take responsibility to eliminate the original
AllocateNode. This is how Graal PEA does. We have proved that the
original object is either scalar replaceable or useless. we will mark
it in Parse and process it in macro-expansion phase. Of course, We are
happy to work together with other developers on JDK-8314179. Leaving
hectic jobs to C2 optimizer is desirable because it makes C2 PEA
simpler.

The biggest challenge I have so far is to capture the failures of the
JIT compiler. It's as if trying to capture a rare Pokémon, which has
high possibility to escape. I have to spend a lot of time upfront
finding a reproducible and then gradually pinpoint the problematic
method. I understand that this is the inherent issue of "dynamic
compiler'. I am told that there's an ongoing project which use c2 as
a PGO static compiler. I wonder if it's possible to convert a dynamic
compilation to a static compilation somehow. If I trained c2 to hit
the bug in AOT mode, I guess I could bisect all compilation units
and find the culprit quicker.

Thanks,
--lx

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-compiler-dev/attachments/20230814/394d7a4b/attachment-0001.htm>