RFR: 8320649: C2: Optimize scoped values
Roland Westrelin
roland at openjdk.org
Tue Dec 5 08:41:49 UTC 2023
This change implements C2 optimizations for calls to
ScopedValue.get(). Indeed, in:
v1 = scopedValue.get();
...
v2 = scopedValue.get();
`v2` can be replaced by `v1` and the second call to `get()` can be
optimized out. That's true whatever is between the 2 calls unless a
new mapping for `scopedValue` is created in between (when that happens
no optimizations is performed for the method being compiled). Hoisting
a `get()` call out of loop for a loop invariant `scopedValue` should
also be legal in most cases.
`ScopedValue.get()` is implemented in java code as a 2 step process. A
cache is attached to the current thread object. If the `ScopedValue`
object is in the cache then the result from `get()` is read from
there. Otherwise a slow call is performed that also inserts the
mapping in the cache. The cache itself is lazily allocated. One
`ScopedValue` can be hashed to 2 different indexes in the cache. On a
cache probe, both indexes are checked. As a consequence, the process
of probing the cache is a multi step process (check if the cache is
present, check first index, check second index if first index
failed). If the cache is populated early on, then when the method that
calls `ScopedValue.get()` is compiled, profile reports the slow path
as never taken and only the read from the cache is compiled.
To perform the optimizations, I added 3 new node types to C2:
- the pair
ScopedValueGetHitsInCacheNode/ScopedValueGetLoadFromCacheNode for
the cache probe
- a cfg node ScopedValueGetResultNode to help locate the result of the
`get()` call in the IR graph.
In pseudo code, once the nodes are inserted, the code of a `get()` is:
hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue)
if (hits_in_the_cache) {
res = ScopedValueGetLoadFromCache(hits_in_the_cache);
} else {
res = ..; //slow call possibly inlined. Subgraph can be arbitray complex
}
res = ScopedValueGetResult(res)
In the snippet:
v1 = scopedValue.get();
...
v2 = scopedValue.get();
Replacing `v2` by `v1` is then done by starting from the
`ScopedValueGetResult` node for the second `get()` and looking for a
dominating `ScopedValueGetResult` for the same `ScopedValue`
object. When one is found, it is used as a replacement. Eliminating
the second `get()` call is achieved by making
`ScopedValueGetHitsInCache` always successful if there's a dominating
`ScopedValueGetResult` and replacing its companion
`ScopedValueGetLoadFromCache` by the dominating
`ScopedValueGetResult`.
Hoisting a `get()` out of loop is achieved by peeling one iteration of
the loop. The optimization above then finds a dominating `get()` and
removed the `get()` from the loop body.
An important case, I think, is when profile predicts the slow case to
never taken. Then the code of `get()` is:
hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue)
if (hits_in_the_cache) {
res = ScopedValueGetLoadFromCache(hits_in_the_cache);
} else {
trap();
}
res = ScopedValueGetResult(res)
The `ScopedValueGetResult` doesn't help and is removed early one. The
optimization process then looks for a pair of
`ScopedValueGetHitsInCache`/`ScopedValueGetLoadFromCache` that
dominates the current pair of
`ScopedValueGetHitsInCache`/`ScopedValueGetLoadFromCache` and can
replace them. In that case, hoisting a `ScopedValue.get()` can be
done by predication and I added special logic in predication for that.
Adding the new nodes to the graph when a `ScopedValue.get()` call is
encountered is done in several steps:
1- inlining of `ScopedValue.get()` is delayed and the call is enqueued
for late inlining.
2- Once the graph is fully constructed, for each call to
`ScopedValue.get()`, a `ScopedValueGetResult` is added between the
result of the call and its uses.
3- the call is then inlined by parsing the `ScopedValue.get()` method
4- finally the subgraph that results is pattern matched and the pieces
required to perform the cache probe are extracted and attached to new
`ScopedValueGetHitsInCache`/`ScopedValueGetLoadFromCache` nodes
There are a couple of reasons for steps 3 and 4:
- As mentioned above probing the cache is a multi step process. Having
only 2 nodes in a simple graph shape to represent it makes it easier
to write robust optimizations
- the subgraph for the method after parsing contains valuable pieces
of information: profile data that captures which of the 2 locations
in the cache is the most likely to causee a hit. Profile data is
attached to the nodes.
Removal of redundant nodes is done during loop opts. The `ScopedValue`
nodes are then expanded. That also happens during loop opts because
once expansion is over, there are opportunities for further
optimizations/clean up that can only happens during loop opts. During
expansion, `ScopedValueGetResult` nodes are removed and
`ScopedValueGetHitsInCache`/`ScopedValueGetLoadFromCache` are expanded
to the multi step process of probing the cache. Profile data attached
to the nodes are used to assign correct frequencies/counts to the If
nodes. Of the 2 locations in the cache that are tested, the one that's
the most likely to see a hit (from profile data) is done first.
-------------
Commit messages:
- white spaces + bug id in test
- test & fix
Changes: https://git.openjdk.org/jdk/pull/16966/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=16966&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8320649
Stats: 2077 lines in 33 files changed: 2047 ins; 1 del; 29 mod
Patch: https://git.openjdk.org/jdk/pull/16966.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/16966/head:pull/16966
PR: https://git.openjdk.org/jdk/pull/16966
More information about the hotspot-compiler-dev
mailing list