RFR: 8345485: C2 MergeLoads: merge adjacent array/native memory loads into larger load [v4]

Mon Jun 16 07:10:41 UTC 2025

On Fri, 13 Jun 2025 08:21:51 GMT, Kuai Wei <kwei at openjdk.org> wrote:

>> @kuaiwei I'm struggling to follow your pseudocode, can you please expand a little or try to describe in other words?
>> 
>> It is good to reduce redundant work, but it has to be worth the complexity. But I'm skeptical of caching things between IGVN optimizations, it is just not something we do, as far as I know. Caching could also be tricky when the cached things are not accurate any more. What exactly would be your approach with caching?
>
> @eme64 I added some detail to pseudo code. I hope it can explain my design. 
> A difficult part of the optimization is that we need find the right combine operator which can be replaced with merged load. Especially it has successor combine operator. We need to find which one is better candidate.
> For a given combine operator, the `MergePrimitiveLoad::run()` will do like this:
> 
> ```c++
> void MergePrimitiveLoad::run(AddNode* _check) {
>    // go down through the unique output of _check to collect successor combine operators
>    GrowableArray<AddNode*> rest_combine_operators = collect_successor_combine_operators(_check)
> 
>    if ( !rest_combine_operator.is_empty() ) {
>        for (AddNode* op in rest_combine_operators) {
>            // collect mergeable load info list from this combine operator
>            merge_list = collect_merge_info_list(op)
>            if (_check is in merge_list) {
>               // we will not optimize _check, it will be merged when op is optimized
>               return;
>            }
>        }
>    }
> 
>    // all successor combine operators are checked, we can start to optimize the given _check operator
>    ...
> }
> 
> // For a given combine operator, collect mergeable load info list. Every item
> // in the result list is a tuple of (Load, combine, shift_value)
> // It's similar with MergePrimitive::collect_merge_list() in previous patch, more detail can be found in code.
> 
> GrowableArray<MergeLoadInfo> collect_merge_info_list(AddNode* combine) {
>    // get the load from input of combine
>    LoadNode* load = get_load_from_input(combine);
>    if (load == nullptr) return;
> 
>    MemNode* mem = memory input of load;
> 
>    for every output of mem {
>       if (output->isa_load()) {
>          // make a merged info for this load node
>          info = merge_load_info(output);
>       }
>       if ( !info->is_invalid() ) {
>          append info into result list
>       }
>    }
> 
>    // check the merged bytes and if they are adjacent.
>    ...
> 
>    return result list;
> }

@kuaiwei Thanks for the additional detail, that was helpful!

To me, it seems that this test here is almost too rigorous, and hence it might be more expensive than necessary. And that is also what you were worried about, right? You were worried that some checks were done over and over again, and that is why you wanted to cache something. Correct?

       for (AddNode* op in rest_combine_operators) {
           // collect mergeable load info list from this combine operator
           merge_list = collect_merge_info_list(op)
           if (_check is in merge_list) {
              // we will not optimize _check, it will be merged when op is optimized
              return;
           }
       }

Can you say why you need to do all the traversals you do in `collect_merge_info_list`? Why do you need to check all all other uses of the mem-input of the load there?

I was wondering if it is not sufficient to just check if the structure on the other side of `op` has the correct shift and load address offset, so that it could be merged in with the merge list. Do you know what I mean?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/24023#issuecomment-2975347439