RFR: 8320649: C2: Optimize scoped values [v4]

Wed Jan 17 14:44:00 UTC 2024

On Thu, 4 Jan 2024 08:12:22 GMT, Roland Westrelin <roland at openjdk.org> wrote:

>> This change implements C2 optimizations for calls to
>> ScopedValue.get(). Indeed, in:
>> 
>> 
>> v1 = scopedValue.get();
>> ...
>> v2 = scopedValue.get();
>> 
>> 
>> `v2` can be replaced by `v1` and the second call to `get()` can be
>> optimized out. That's true whatever is between the 2 calls unless a
>> new mapping for `scopedValue` is created in between (when that happens
>> no optimizations is performed for the method being compiled). Hoisting
>> a `get()` call out of loop for a loop invariant `scopedValue` should
>> also be legal in most cases.
>> 
>> `ScopedValue.get()` is implemented in java code as a 2 step process. A
>> cache is attached to the current thread object. If the `ScopedValue`
>> object is in the cache then the result from `get()` is read from
>> there. Otherwise a slow call is performed that also inserts the
>> mapping in the cache. The cache itself is lazily allocated. One
>> `ScopedValue` can be hashed to 2 different indexes in the cache. On a
>> cache probe, both indexes are checked. As a consequence, the process
>> of probing the cache is a multi step process (check if the cache is
>> present, check first index, check second index if first index
>> failed). If the cache is populated early on, then when the method that
>> calls `ScopedValue.get()` is compiled, profile reports the slow path
>> as never taken and only the read from the cache is compiled.
>> 
>> To perform the optimizations, I added 3 new node types to C2:
>> 
>> - the pair
>>   ScopedValueGetHitsInCacheNode/ScopedValueGetLoadFromCacheNode for
>>   the cache probe
>>   
>> - a cfg node ScopedValueGetResultNode to help locate the result of the
>>   `get()` call in the IR graph.
>> 
>> In pseudo code, once the nodes are inserted, the code of a `get()` is:
>> 
>> 
>> hits_in_the_cache = ScopedValueGetHitsInCache(scopedValue)
>> if (hits_in_the_cache) {
>>   res = ScopedValueGetLoadFromCache(hits_in_the_cache);
>> } else {
>>   res = ..; //slow call possibly inlined. Subgraph can be arbitray complex
>> }
>> res = ScopedValueGetResult(res)
>> 
>> 
>> In the snippet:
>> 
>> 
>> v1 = scopedValue.get();
>> ...
>> v2 = scopedValue.get();
>> 
>> 
>> Replacing `v2` by `v1` is then done by starting from the
>> `ScopedValueGetResult` node for the second `get()` and looking for a
>> dominating `ScopedValueGetResult` for the same `ScopedValue`
>> object. When one is found, it is used as a replacement. Eliminating
>> the second `get()` call is achieved by making
>> `ScopedValueGetHitsInCache` always successful if there's a dominating
>> `Scoped...
>
> Roland Westrelin has updated the pull request incrementally with one additional commit since the last revision:
> 
>   merge fix

Second badge of comments.

src/hotspot/share/classfile/vmIntrinsics.hpp line 301:

> 299:    do_name(     slowGet_name,                                    "slowGet")                                             \
> 300:   do_intrinsic(_SVCacheInvalidate,       java_lang_ScopedValue_Cache, invalidate_name, int_void_signature, F_S)         \
> 301:    do_name(     invalidate_name,                                 "invalidate")                                          \

What prevents us from writing out `_scopedValueGet` etc, just like the other names here?

src/hotspot/share/opto/callGenerator.cpp line 1051:

> 1049:     Node* mem = scoped_value_cache->in(TypeFunc::Memory);
> 1050:     Node* c = scoped_value_cache->in(TypeFunc::Control);
> 1051:     Node* io = scoped_value_cache->in(TypeFunc::I_O);

I would use `input_mem`, `input_ctrl`, `input_io`. Then the replacements below would read more intuitively.

src/hotspot/share/opto/callGenerator.cpp line 1058:

> 1056: 
> 1057:     // remove the scopedValueCache() call
> 1058:     CallProjections scoped_value_cache_projs = CallProjections();

Suggestion:

    CallProjections scoped_value_cache_projs;

Is the assignment really necessary, or style-wise preferrable? I see you use it without elsewhere.

src/hotspot/share/opto/callGenerator.cpp line 1089:

> 1087:                                                                                         second_index == nullptr ? C->top() : second_index);
> 1088: 
> 1089:     // It will later be expanded back to all the checks so record profile data

Should we also copy the node info (e.g. line number etc)?

src/hotspot/share/opto/callGenerator.cpp line 1115:

> 1113:     } else {
> 1114:       sv_hits_in_cache->set_profile_data(2, 0, 0);
> 1115:     }

Another case of code duplication. Why not write a method that extracts `cnt, prob` for an `iff`?`Or maybe that already exists?

src/hotspot/share/opto/callGenerator.cpp line 1120:

> 1118: 
> 1119:     // And compute the probability of a miss in the cache
> 1120:     float prob;

Suggestion:

    float cache_miss_prob;

src/hotspot/share/opto/callGenerator.cpp line 1122:

> 1120:     float prob;
> 1121:     // get_cache_prob: probability that cache array is not null
> 1122:     // get_first_prob: probability of a miss

`get_first_prob` sounds like the probability of "getting it", so not a miss. Which is it?

src/hotspot/share/opto/callGenerator.cpp line 1127:

> 1125:       prob = PROB_UNKNOWN;
> 1126:     } else {
> 1127:       prob = (1 - get_cache_prob) + get_cache_prob * (get_first_prob + (1 - get_first_prob) * get_second_prob);

Suggestion:

      cache_miss_prob = (1 - get_cache_prob) + // cache array is null
                      get_cache_prob * ( // cache array not null, and:
                         get_first_prob + // first has cache miss
                         (1 - get_first_prob) * get_second_prob // first hits, but second misses
                       );

src/hotspot/share/opto/callGenerator.cpp line 1139:

> 1137: 
> 1138:     // Merge the paths that produce the result (in case there's a slow path)
> 1139:     Node* r = new RegionNode(3);

Suggestion:

    Node* region_fast_slow = new RegionNode(3);

I think we can affort a slightly more expressive name.

src/hotspot/share/opto/callGenerator.cpp line 1151:

> 1149:       phi_cache_value->init_req(1, C->top());
> 1150:       phi_mem->init_req(1, C->top());
> 1151:       phi_io->init_req(1, C->top());

Why not just put the slow path on input `2`, and make the size of the RegionNode depend on if there is a `slow_call`? Then you can avoid these top inputs, right?

src/hotspot/share/opto/callGenerator.cpp line 1155:

> 1153:       CallProjections slow_projs;
> 1154:       slow_call->extract_projections(&slow_projs, false);
> 1155:       Node* fallthrough = slow_projs.fallthrough_catchproj->clone();

Why does that have to be cloned?

src/hotspot/share/opto/callGenerator.cpp line 1166:

> 1164: 
> 1165:     // ScopedValueGetLoadFromCache is a single that represents the result of a hit in the cache
> 1166:     Node* cache_value = kit.gvn().transform(new ScopedValueGetLoadFromCacheNode(C, in_cache, sv_hits_in_cache));

Suggestion:

    Node* sv_load_from_cache = kit.gvn().transform(new ScopedValueGetLoadFromCacheNode(C, in_cache, sv_hits_in_cache));

For consistency with `sv_hits_in_cache` and the node class name.

src/hotspot/share/opto/cfgnode.hpp line 737:

> 735:   ProjNode* result_out() {
> 736:     return proj_out_or_null(Result);
> 737:   }

Either verify that we have not null, or else rename to `result_out_or_null`.

src/hotspot/share/opto/compile.cpp line 465:

> 463:   remove_useless_late_inlines(         &_boxing_late_inlines, useful);
> 464:   remove_useless_late_inlines(&_vector_reboxing_late_inlines, useful);
> 465:   remove_useless_late_inlines(          &_scoped_value_late_inlines, useful);

Suggestion:

  remove_useless_late_inlines(   &_scoped_value_late_inlines, useful);

src/hotspot/share/opto/compile.cpp line 2034:

> 2032: 
> 2033: void Compile::inline_scoped_value_calls(PhaseIterGVN& igvn) {
> 2034:   if (_scoped_value_late_inlines.length() > 0) {

Rather than indenting everything, I would just check `_scoped_value_late_inlines.is_empty()` and return.

-------------

Changes requested by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/16966#pullrequestreview-1827249142
PR Review Comment: https://git.openjdk.org/jdk/pull/16966#discussion_r1455708715
PR Review Comment: https://git.openjdk.org/jdk/pull/16966#discussion_r1455535612
PR Review Comment: https://git.openjdk.org/jdk/pull/16966#discussion_r1455507963
PR Review Comment: https://git.openjdk.org/jdk/pull/16966#discussion_r1455600271
PR Review Comment: https://git.openjdk.org/jdk/pull/16966#discussion_r1455631389
PR Review Comment: https://git.openjdk.org/jdk/pull/16966#discussion_r1455639533
PR Review Comment: https://git.openjdk.org/jdk/pull/16966#discussion_r1455649353
PR Review Comment: https://git.openjdk.org/jdk/pull/16966#discussion_r1455651376
PR Review Comment: https://git.openjdk.org/jdk/pull/16966#discussion_r1455669565
PR Review Comment: https://git.openjdk.org/jdk/pull/16966#discussion_r1455678033
PR Review Comment: https://git.openjdk.org/jdk/pull/16966#discussion_r1455683828
PR Review Comment: https://git.openjdk.org/jdk/pull/16966#discussion_r1455692682
PR Review Comment: https://git.openjdk.org/jdk/pull/16966#discussion_r1455721814
PR Review Comment: https://git.openjdk.org/jdk/pull/16966#discussion_r1455724546
PR Review Comment: https://git.openjdk.org/jdk/pull/16966#discussion_r1455742818