RFC : Approach to handle Allocation Merges in C2 Scalar Replacement

Fri Feb 11 22:22:34 UTC 2022

>>> For this simple case we can teach C2's IGVN to split fields loads 
>>> through Phi so that phi(p0, p1) is not used and allocations as well. 
>>> We can do that because we know that allocations and phi do not escape.
>>
>> Seems like split_unique_types() pass is a prerequisite to make such 
>> transformations predictable on a larger scale. It simplifies the 
>> memory graph and untangles instance-related memory operations from the 
>> rest of the graph.
> 
> Can you elaborate on your suggestion?
> 
> As you correctly pointed in your original post, EA and SR heavily depend 
> on _instance_id which is Allocation node's _idx value. And the solution 
> for this is either remove reference to merge phi. Or introduce new 
> "virtual allocation" which fields are populated by loads - but it is 
> again needs to split through phi.

My rough proposal is to group interfering non-escaping allocations into 
"clusters" and introduce separate memory slices for such clusters (as 
part of split_unique_types()). The clusters should not intersect, 
otherwise, the slice won't be unique anymore.

Once memory graph for unique alias is separated, it should be much 
simpler to further optimize it. If all memory operations are folded, the 
allocation can go away.

Splitting memory operations through phis should help untangle the 
allocations. So, even if one allocation can't be fully scalarized, it 
shouldn't keep other allocations from being scalarized (unless they alias).

I don't have a good understanding how new logic will interact with 
unique instance slices & _instance_id. Maybe a gradual transition from 
original graph to clustered slices to unique instance slices.

Best regards,
Vladimir Ivanov

>>
>> Otherwise, I believe wide memory states/merges would pose serious 
>> problems. E.g., when a load is split through a phi, you need to pick 
>> correct memory states for the new loads above the phi which quickly 
>> becomes quite a challenging task.
> 
> Especially in loops. Actually I did tried such approach before and I 
> agree that it is hard.
> 
> Thanks,
> Vladimir K
> 
>>
>> Best regards,
>> Vladimir Ivanov
>>
>>>
>>> public int ex1(boolean cond, int first, int second) {
>>>      p0 = Allocate(...);
>>>      ...
>>>      p0.x = first;
>>>      p0.y = second;
>>>
>>>      if (cond) {
>>>          p1 = Allocate(...);
>>>          ...
>>>          p1.x = second;
>>>          p1.y = first;
>>>      }
>>>
>>>      p = phi(p0, p1) // unused and will be removed
>>>
>>>
>>>      return phi(p0.x,p1.x) - phi(p0.y, p1.y);
>>> }
>>>
>>> Thanks,
>>> Vladimir K
>>>
>>> On 2/10/22 11:29 AM, Vladimir Ivanov wrote:
>>>> (BCCing hotspot-dev and moving the discussion to hotspot-compiler-dev.)
>>>>
>>>> Hi Cesar,
>>>>
>>>> Thanks for looking into enhancing EA.
>>>>
>>>> Overall, the proposal looks reasonable.
>>>>
>>>> I suggest to look more closely at split_unique_types().
>>>> It introduces a dedicated class of alias categories for fields of 
>>>> the allocation being eliminated and clones memory graph. I don't see 
>>>> why it shouldn't work for multiple allocations.
>>>>
>>>> Moreover, split_unique_types() will break if you start optimizing 
>>>> multiple allocations at once. The notion of unique alias should be 
>>>> adjusted and cover the union of unique aliases for all interacting 
>>>> allocations.
>>>>
>>>> Seems like you need to enhance SR to work on non-intersecting 
>>>> clusters of allocations.
>>>>
>>>> One thing to take care of: scalar replacement relies on 
>>>> TypeOopPtr::instance_id().
>>>>
>>>>    // If not InstanceTop or InstanceBot, indicates that this is
>>>>    // a particular instance of this type which is distinct.
>>>>    // This is the node index of the allocation node creating this 
>>>> instance.
>>>>    int           _instance_id;
>>>>
>>>> It'll break when multiple allocations are in play.
>>>>
>>>> Best regards,
>>>> Vladimir Ivanov
>>>>
>>>> On 09.02.2022 04:45, Cesar Soares Lucas wrote:
>>>>> Hi there again!
>>>>>
>>>>> Can you please give me feedback on the following approach to at 
>>>>> least partially
>>>>> address [1], the scalar replacement allocation merge issue?
>>>>>
>>>>> The problem that I am trying to solve arises when allocations are 
>>>>> merged after a
>>>>> control flow split. The code below shows _one example_ of such a 
>>>>> situation.
>>>>>
>>>>> public int ex1(boolean cond, int x, int y) {
>>>>>      Point p = new Point(x, y);
>>>>>      if (cond)
>>>>>          p = new Point(y, x);
>>>>>      // Allocations for p are merged here.
>>>>>      return p.calc();
>>>>> }
>>>>>
>>>>> Assuming the method calls on "p" are inlined then the allocations 
>>>>> will not
>>>>> escape the method. The C2 IR for this method will look like this:
>>>>>
>>>>> public int ex1(boolean cond, int first, int second) {
>>>>>      p0 = Allocate(...);
>>>>>      ...
>>>>>      p0.x = first;
>>>>>      p0.y = second;
>>>>>
>>>>>      if (cond) {
>>>>>          p1 = Allocate(...);
>>>>>          ...
>>>>>          p1.x = second;
>>>>>          p1.y = first;
>>>>>      }
>>>>>
>>>>>      p = phi(p0, p1)
>>>>>
>>>>>      return p.x - p.y;
>>>>> }
>>>>>
>>>>> However, one of the constraints implemented here [2], specifically 
>>>>> the third
>>>>> one, will prevent the objects from being scalar replaced.
>>>>>
>>>>> The approach that I'm considering for solving the problem is to 
>>>>> replace the Phi
>>>>> node `p = phi(p0, p1)` with new Phi nodes for each of the fields of 
>>>>> the objects
>>>>> in the original Phi. The IR for `ex1` would look something like 
>>>>> this after the
>>>>> transformation:
>>>>>
>>>>> public int ex1(boolean cond, int first, int second) {
>>>>>      p0 = Allocate(...);
>>>>>      ...
>>>>>      p0.x = first;
>>>>>      p0.y = second;
>>>>>
>>>>>      if (cond) {
>>>>>          p1 = Allocate(...);
>>>>>          ...
>>>>>          p1.x = second;
>>>>>          p1.y = first;
>>>>>      }
>>>>>
>>>>>      pX = phi(first, second)
>>>>>      pY = phi(second, first)
>>>>>
>>>>>      return pX - pY;
>>>>> }
>>>>>
>>>>> I understand that this transformation might not be applicable for 
>>>>> all cases and
>>>>> that it's not as simple as illustrated above. Also, it seems to me 
>>>>> that much of
>>>>> what I'd have to implement is already implemented in other steps of 
>>>>> the Scalar
>>>>> Replacement pipeline (which is a good thing). To work around these
>>>>> implementation details I plan to use as much of the existing code 
>>>>> as possible.
>>>>> The algorithm for the transformation would be like this:
>>>>>
>>>>> split_phis(phi)
>>>>>      # If output of phi escapes, or something uses its identity, etc
>>>>>      # then we can't remove it. The conditions here might possible 
>>>>> be the
>>>>>      # same as the ones implemented in 
>>>>> `PhaseMacroExpand::can_eliminate_allocation`
>>>>>      if cant_remove_phi_output(phi)
>>>>>          return ;
>>>>>
>>>>>      # Collect a set of tuples(F,U) containing nodes U that uses 
>>>>> field F
>>>>>      # member of the object resulting from `phi`.
>>>>>      fields_used = collect_fields_used_after_phi(phi)
>>>>>
>>>>>      foreach field in fields_used
>>>>>          producers = {}
>>>>>
>>>>>          # Create a list with the last Store for each field "field" 
>>>>> on the
>>>>>          # scope of each of the Phi input objects.
>>>>>          foreach o in phi.inputs
>>>>>              # The function called below might re-use a lot of the 
>>>>> code/logic in `PhaseMacroExpand::scalar_replacement`
>>>>>              producers += last_store_to_o_field(0, field)
>>>>>          # Create a new phi node whose inputs are the Store's to 
>>>>> 'field'
>>>>>          field_phi = create_new_phi(producers)
>>>>>
>>>>>          update_consumers(field, field_phi)
>>>>>
>>>>> The implementation that I envisioned would be as a "pre-process" 
>>>>> [3] step just
>>>>> after EA but before the constraint checks in 
>>>>> `adjust_scalar_replaceable_state`
>>>>> [2]. If we agree that the overall Scalar Replacement implementation 
>>>>> goes through
>>>>> the following major phases:
>>>>>
>>>>>      1. Identify the Escape Status of objects.
>>>>>      2. Adjust object Escape and/or Scalar Replacement status based 
>>>>> on a set of constraints.
>>>>>      3. Make call to Split_unique_types [4].
>>>>>      4 Iterate over object and array allocations.
>>>>>          4.1 Check if allocation can be eliminated.
>>>>>          4.2 Perform scalar replacement. Replace uses of object in 
>>>>> Safepoints.
>>>>>          4.3 Process users of CheckCastPP other than Safepoint: 
>>>>> AddP, ArrayCopy and CastP2X.
>>>>>
>>>>> The transformation that I am proposing would change the overall 
>>>>> flow to look
>>>>> like this:
>>>>>
>>>>>      1. Identify the Escape Status of objects.
>>>>>      2. ----> New: "Split phi functions" <----
>>>>>      2. Adjust object Escape and/or Scalar Replacement status based 
>>>>> on a set of constraints.
>>>>>      3. Make call to Split_unique_types [14].
>>>>>      4 Iterate over object and array allocations.
>>>>>          4.1 ----> Moved to split_phi: "Check if allocation can be 
>>>>> eliminated" <----
>>>>>          4.2 Perform scalar replacement. Replace uses of object in 
>>>>> Safepoints.
>>>>>          4.3 Process users of CheckCastPP other than Safepoint: 
>>>>> AddP, ArrayCopy and CastP2X.
>>>>>
>>>>> Please let me know what you think and thank you for taking the time 
>>>>> to review
>>>>> this!
>>>>>
>>>>>
>>>>> Regards,
>>>>> Cesar
>>>>>
>>>>> Notes:
>>>>>
>>>>>      [1] I am not sure yet how this approach will play with the 
>>>>> case of a merge
>>>>>          with NULL.
>>>>>      [2] 
>>>>> https://github.com/openjdk/jdk/blob/2f71a6b39ed6bb869b4eb3e81bc1d87f4b3328ff/src/hotspot/share/opto/escape.cpp#L1809 
>>>>>
>>>>>
>>>>>      [3] Another option would be to "patch" the current 
>>>>> implementation to be able
>>>>>          to handle the merges. I am not certain that the "patch" 
>>>>> approach would be
>>>>>          better, however, the "pre-process" approach is certainly 
>>>>> much easier to test
>>>>>          and more readable.
>>>>>
>>>>>      [4] I cannot say I understand 100% the effects of executing
>>>>>          split_unique_types(). Would the transformation that I am 
>>>>> proposing need to
>>>>>          be after the call to split_unique_types?