RFR: 8320649: C2: Optimize scoped values

Tue Dec 5 08:41:49 UTC 2023

This change implements C2 optimizations for calls to
ScopedValue.get(). Indeed, in:

v1 = scopedValue.get();
...
v2 = scopedValue.get();

`v2` can be replaced by `v1` and the second call to `get()` can be
optimized out. That's true whatever is between the 2 calls unless a
new mapping for `scopedValue` is created in between (when that happens
no optimizations is performed for the method being compiled). Hoisting
a `get()` call out of loop for a loop invariant `scopedValue` should
also be legal in most cases.

`ScopedValue.get()` is implemented in java code as a 2 step process. A
cache is attached to the current thread object. If the `ScopedValue`
object is in the cache then the result from `get()` is read from
there. Otherwise a slow call is performed that also inserts the
mapping in the cache. The cache itself is lazily allocated. One
`ScopedValue` can be hashed to 2 different indexes in the cache. On a
cache probe, both indexes are checked. As a consequence, the process
of probing the cache is a multi step process (check if the cache is
present, check first index, check second index if first index
failed). If the cache is populated early on, then when the method that
calls `ScopedValue.get()` is compiled, profile reports the slow path
as never taken and only the read from the cache is compiled.

To perform the optimizations, I added 3 new node types to C2:

- the pair
  ScopedValueGetHitsInCacheNode/ScopedValueGetLoadFromCacheNode for
  the cache probe

- a cfg node ScopedValueGetResultNode to help locate the result of the
  `get()` call in the IR graph.

In pseudo code, once the nodes are inserted, the code of a `get()` is:

hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue)
if (hits_in_the_cache) {
  res = ScopedValueGetLoadFromCache(hits_in_the_cache);
} else {
  res = ..; //slow call possibly inlined. Subgraph can be arbitray complex
}
res = ScopedValueGetResult(res)

In the snippet:

v1 = scopedValue.get();
...
v2 = scopedValue.get();

Replacing `v2` by `v1` is then done by starting from the
`ScopedValueGetResult` node for the second `get()` and looking for a
dominating `ScopedValueGetResult` for the same `ScopedValue`
object. When one is found, it is used as a replacement. Eliminating
the second `get()` call is achieved by making
`ScopedValueGetHitsInCache` always successful if there's a dominating
`ScopedValueGetResult` and replacing its companion
`ScopedValueGetLoadFromCache` by the dominating
`ScopedValueGetResult`.

Hoisting a `get()` out of loop is achieved by peeling one iteration of
the loop. The optimization above then finds a dominating `get()` and
removed the `get()` from the loop body.

An important case, I think, is when profile predicts the slow case to
never taken. Then the code of `get()` is:

hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue)
if (hits_in_the_cache) {
  res = ScopedValueGetLoadFromCache(hits_in_the_cache);
} else {
  trap();
}
res = ScopedValueGetResult(res)

The `ScopedValueGetResult` doesn't help and is removed early one. The
optimization process then looks for a pair of
`ScopedValueGetHitsInCache`/`ScopedValueGetLoadFromCache` that
dominates the current pair of
`ScopedValueGetHitsInCache`/`ScopedValueGetLoadFromCache` and can
replace them.  In that case, hoisting a `ScopedValue.get()` can be
done by predication and I added special logic in predication for that.

Adding the new nodes to the graph when a `ScopedValue.get()` call is
encountered is done in several steps:

1- inlining of `ScopedValue.get()` is delayed and the call is enqueued
  for late inlining.

2- Once the graph is fully constructed, for each call to
  `ScopedValue.get()`, a `ScopedValueGetResult` is added between the
  result of the call and its uses.

3- the call is then inlined by parsing the `ScopedValue.get()` method

4- finally the subgraph that results is pattern matched and the pieces
required to perform the cache probe are extracted and attached to new
`ScopedValueGetHitsInCache`/`ScopedValueGetLoadFromCache` nodes

There are a couple of reasons for steps 3 and 4:

- As mentioned above probing the cache is a multi step process. Having
  only 2 nodes in a simple graph shape to represent it makes it easier
  to write robust optimizations

- the subgraph for the method after parsing contains valuable pieces
  of information: profile data that captures which of the 2 locations
  in the cache is the most likely to causee a hit. Profile data is
  attached to the nodes.

Removal of redundant nodes is done during loop opts. The `ScopedValue`
nodes are then expanded. That also happens during loop opts because
once expansion is over, there are opportunities for further
optimizations/clean up that can only happens during loop opts. During
expansion, `ScopedValueGetResult` nodes are removed and
`ScopedValueGetHitsInCache`/`ScopedValueGetLoadFromCache` are expanded
to the multi step process of probing the cache. Profile data attached
to the nodes are used to assign correct frequencies/counts to the If
nodes. Of the 2 locations in the cache that are tested, the one that's
the most likely to see a hit (from profile data) is done first.

-------------

Commit messages:
 - white spaces + bug id in test
 - test & fix

Changes: https://git.openjdk.org/jdk/pull/16966/files
 Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16966&range=00
  Issue: https://bugs.openjdk.org/browse/JDK-8320649
  Stats: 2077 lines in 33 files changed: 2047 ins; 1 del; 29 mod
  Patch: https://git.openjdk.org/jdk/pull/16966.diff
  Fetch: git fetch https://git.openjdk.org/jdk.git pull/16966/head:pull/16966

PR: https://git.openjdk.org/jdk/pull/16966