RFR: 8320649: C2: Optimize scoped values [v18]

Thu May 23 16:39:14 UTC 2024

On Thu, 2 May 2024 14:54:17 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> This change implements C2 optimizations for calls to
>> ScopedValue.get(). Indeed, in:
>> 
>> 
>> v1 = scopedValue.get();
>> ...
>> v2 = scopedValue.get();
>> 
>> 
>> `v2` can be replaced by `v1` and the second call to `get()` can be
>> optimized out. That's true whatever is between the 2 calls unless a
>> new mapping for `scopedValue` is created in between (when that happens
>> no optimizations is performed for the method being compiled). Hoisting
>> a `get()` call out of loop for a loop invariant `scopedValue` should
>> also be legal in most cases.
>> 
>> `ScopedValue.get()` is implemented in java code as a 2 step process. A
>> cache is attached to the current thread object. If the `ScopedValue`
>> object is in the cache then the result from `get()` is read from
>> there. Otherwise a slow call is performed that also inserts the
>> mapping in the cache. The cache itself is lazily allocated. One
>> `ScopedValue` can be hashed to 2 different indexes in the cache. On a
>> cache probe, both indexes are checked. As a consequence, the process
>> of probing the cache is a multi step process (check if the cache is
>> present, check first index, check second index if first index
>> failed). If the cache is populated early on, then when the method that
>> calls `ScopedValue.get()` is compiled, profile reports the slow path
>> as never taken and only the read from the cache is compiled.
>> 
>> To perform the optimizations, I added 3 new node types to C2:
>> 
>> - the pair
>>   ScopedValueGetHitsInCacheNode/ScopedValueGetLoadFromCacheNode for
>>   the cache probe
>>   
>> - a cfg node ScopedValueGetResultNode to help locate the result of the
>>   `get()` call in the IR graph.
>> 
>> In pseudo code, once the nodes are inserted, the code of a `get()` is:
>> 
>> 
>> hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue)
>> if (hits_in_the_cache) {
>>   res = ScopedValueGetLoadFromCache(hits_in_the_cache);
>> } else {
>>   res = ..; //slow call possibly inlined. Subgraph can be arbitray complex
>> }
>> res = ScopedValueGetResult(res)
>> 
>> 
>> In the snippet:
>> 
>> 
>> v1 = scopedValue.get();
>> ...
>> v2 = scopedValue.get();
>> 
>> 
>> Replacing `v2` by `v1` is then done by starting from the
>> `ScopedValueGetResult` node for the second `get()` and looking for a
>> dominating `ScopedValueGetResult` for the same `ScopedValue`
>> object. When one is found, it is used as a replacement. Eliminating
>> the second `get()` call is achieved by making
>> `ScopedValueGetHitsInCache` always successful if there's a dominating
>> `Scoped...
>
> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision:
> 
>   whitespaces

There are 3 optimizations that this patch performs:

1- replace a `ScopedValue.get()` by a dominating `ScopedValue.get()`
2- move a `ScopedValue.get()` out of loop
3- streamline code emitted for `ScopedValue.get()`

There are 2 `ScopedValue.get()` patterns that are handled:

1- when profile reports that the slow path that updates the cache is not taken: `ScopedValue.get()` always hits in the cache
2- when profile reports that the slow path is taken and the code that updates the cache is included in the compiled code

Obviously, before `ScopedValue.get()` can use the cache, the cache has to be updated. So the slow path is taken at some point. But because, hotspot doesn't profile the first invocations of a method, profile data can report the slow path as never taken. That's likely what happens with a simple micro benchmark. Also there can be multiple `ScopedValue` and profile can still report the slow path as not taken. It's a more a matter of when the cache is updated than how many `ScopedValue` there are.

The patch performs optimization 1-, 2- and 3- for patterns 1- and 2- but, it does it better for pattern 1- than 2-. If the slow path is included in compiled code, then only a `ScopedValue.get` call that dominate the backedge of a loop is hoisted out of loop. That's because hoisting is a 2 step process in the case of pattern 2-: peel one iteration of the loop and replace the `ScopedValue.get` in the loop with the one from the peeled iteration. When the slow path is compiled in the method, a lot of extra code is also included and that likely disrupts other optimizations that might be needed before `ScopedValue.get` can be optimized. For instance, the slow path likely comes with a non inlined call that could get in the way of memory subgraph optimizations. That's why I made sure the patch handles both patterns.

I thought about always speculating initially that the slow path is not taken when compiling `ScopedValue.get` or trying to find some way to work around profile pollution. I also thought about having cleverer ways of optimizing pattern 2-. But the patch felt complicated enough and when @theRealAph experimented with it, he reported that it was doing ok the way it is.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/16966#issuecomment-2127601116