[External] : Re: RFC : Approach to handle Allocation Merges in C2 Scalar Replacement

Thu Feb 17 15:00:45 UTC 2022

Hi Cesar,

> We start of with this:
> 
> static int testPolluted(int r, boolean b) {
>      A obj1 = new A(r);
>      A obj2 = new A(r);
> 
>      A obj = (b ? obj1 : obj2);
>      for (int i = 1; i < r; i++) {
>          obj.i++;
>      }
> 
>      return obj1.i + obj2.i;
> }
> 
> And then we cluster interfering allocations together in disjoint clusters. Two
> questions on this point:
> 
> 1) What exactly grouping the allocations in clusters helps? I believe when you
> say it'll help, I'm just trying to understand what exactly how.

It enables split_unique_types() to introduce dedicated memory slices for 
fields of allocated objects (a pre-requisite for scalarization step). 
When there are memory accesses with indeterminate base, it's illegal to 
introduce memory slices for a particular allocation, because they will 
interfere with existing slices.

But when you process the whole interfering cluster of allocations at 
once, then it's correct to introduce new slices for the whole cluster.
You know all the interactions happen between allocations inside the 
cluster and associate new slices with them.

> 2) To create the allocation clusters is it sufficient to simply put together
> allocations that merge at Phi's (including transitive merges) or we'll need
> something more complex?

Yes, you need to enumerate all possible base values for all accesses 
operating on bases from the cluster (construct a transitive closure).

> 
> After we create the clusters we'll then run a few transformations, including
> splitting memory accesses through Phis. This would transform the example above
> into something like this:

I think the order should be reversed. Do split_unique_types() first to 
get rid of any non-essential interference from unrelated memory 
operations. Then optimize constructed memory graph trying to untangle 
individual allocations from each other.

Best regards,
Vladimir Ivanov

> static int testUnqiue(int r, boolean b) {
>      A obj1 = new A(r);
>      A obj2 = new A(r);
> 
>      for (int i = 1; i < r; i++) {
>          if (b) {
>              obj1.i++;
>          } else {
>              obj2.i++;
>          }
>      }
>      return obj1.i + obj2.i;
> }
> 
> Next we'll run "split_unique_types" and do scalar replacement as we do today and
> eventually we should have something like this:
> 
> static int testUnqiue(int r, boolean b) {
>      int i1 = r;
>      int i2 = r;
> 
>      for (int i = 1; i < r; i++) {
>          if (b) {
>              i1++;
>          } else {
>              i2++;
>          }
>      }
> 
>      return i1 + i2;
> }
> 
> 
> Thanks,
> Cesar
> 
> 
> ________________________________________
> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> Sent: February 14, 2022 11:39 AM
> To: Vladimir Kozlov; Cesar Soares Lucas; hotspot compiler
> Cc: Brian Stafford; Martijn Verburg
> Subject: Re: RFC : Approach to handle Allocation Merges in C2 Scalar Replacement
> 
> 
>> Once memory graph for unique alias is separated, it should be much
>> simpler to further optimize it. If all memory operations are folded, the
>> allocation can go away.
>>
>> Splitting memory operations through phis should help untangle the
>> allocations. So, even if one allocation can't be fully scalarized, it
>> shouldn't keep other allocations from being scalarized (unless they alias).
>>
>> I don't have a good understanding how new logic will interact with
>> unique instance slices & _instance_id. Maybe a gradual transition from
>> original graph to clustered slices to unique instance slices.
> 
> I took a closer look at a slightly more complicated example:
> 
>       class A {
>           int i;
>           A(int i) { this.i = i; }
>       }
> 
>       static int testPolluted(int r, boolean b) {
>           A obj1 = new A(r);
>           A obj2 = new A(r);
> 
>           A obj = (b ? obj1 : obj2);
>           for (int i = 1; i < r; i++) {
>               obj.i++;
>           }
> 
>           return obj1.i + obj2.i;
>       }
> 
> There are 3 interacting memory slices: obj1.i, obj2.i, and obj.i (which
> alias with obj1.i and obj2.i).
> 
> Unless the access inside the loop is eliminated, there's no way to
> untangle the memory graphs for obj1.i and obj2.i and scalarize allocations.
> 
> In this particular case, it's quite straightforward to untangle the
> graphs for instance types (and make both obj1 and obj2 scalarizable):
> 
>       static int testUnqiue(int r, boolean b) {
>           A obj1 = new A(r);
>           A obj2 = new A(r);
> 
>           for (int i = 1; i < r; i++) {
>               if (b) {
>                   obj1.i++;
>               } else {
>                   obj2.i++;
>               }
>           }
>           return obj1.i + obj2.i;
>       }
> 
> But depending on the actual shape of the base pointer (here it's
> Phi(obj1,obj2)), it may become infeasible (from performance perspective)
> or even impossible (e.g., for complex loop variant conditions) to
> separate instance slices.
> 
> So, gradual approach seems the way to go:
>     (1) split memory graph for clusters of interacting non-escaping
> allocations;
> 
>     (2) perform adhoc transformation targeted at untangling aliasing
> accesses (e.g, split-through-phi);
> 
>     (3) extract unique instance types where possible, thus making the
> corresponding allocation scalar replaceable
> 
> =====================================================================
> 
> Also, one observation: interacting allocations don't have to be of the
> same class.
> 
>       static int testPolluted(int r, boolean b) {
>           A obj1 = new B1(r);
>           A obj2 = new B2(r);
> 
>           A obj = (b ? obj1 : obj2);
>           for (int i = 1; i < r; i++) {
>               obj.i++;
>           }
> 
>           return obj1.i + obj2.i;
>       }
> 
> Requiring unique instance types to perform SR (and not enhancing SR to
> handle aliasing allocations case) helps avoid some challenges in
> instance rematerialization logic at safepoints, because actual shape of
> the scalarized object (its class and exact set of fields) becomes a
> runtime property.
> 
> Best regards,
> Vladimir Ivanov
> 
>>>>
>>>> Otherwise, I believe wide memory states/merges would pose serious
>>>> problems. E.g., when a load is split through a phi, you need to pick
>>>> correct memory states for the new loads above the phi which quickly
>>>> becomes quite a challenging task.
>>>
>>> Especially in loops. Actually I did tried such approach before and I
>>> agree that it is hard.
>>>
>>> Thanks,
>>> Vladimir K
>>>
>>>>
>>>> Best regards,
>>>> Vladimir Ivanov
>>>>
>>>>>
>>>>> public int ex1(boolean cond, int first, int second) {
>>>>>       p0 = Allocate(...);
>>>>>       ...
>>>>>       p0.x = first;
>>>>>       p0.y = second;
>>>>>
>>>>>       if (cond) {
>>>>>           p1 = Allocate(...);
>>>>>           ...
>>>>>           p1.x = second;
>>>>>           p1.y = first;
>>>>>       }
>>>>>
>>>>>       p = phi(p0, p1) // unused and will be removed
>>>>>
>>>>>
>>>>>       return phi(p0.x,p1.x) - phi(p0.y, p1.y);
>>>>> }
>>>>>
>>>>> Thanks,
>>>>> Vladimir K
>>>>>
>>>>> On 2/10/22 11:29 AM, Vladimir Ivanov wrote:
>>>>>> (BCCing hotspot-dev and moving the discussion to
>>>>>> hotspot-compiler-dev.)
>>>>>>
>>>>>> Hi Cesar,
>>>>>>
>>>>>> Thanks for looking into enhancing EA.
>>>>>>
>>>>>> Overall, the proposal looks reasonable.
>>>>>>
>>>>>> I suggest to look more closely at split_unique_types().
>>>>>> It introduces a dedicated class of alias categories for fields of
>>>>>> the allocation being eliminated and clones memory graph. I don't
>>>>>> see why it shouldn't work for multiple allocations.
>>>>>>
>>>>>> Moreover, split_unique_types() will break if you start optimizing
>>>>>> multiple allocations at once. The notion of unique alias should be
>>>>>> adjusted and cover the union of unique aliases for all interacting
>>>>>> allocations.
>>>>>>
>>>>>> Seems like you need to enhance SR to work on non-intersecting
>>>>>> clusters of allocations.
>>>>>>
>>>>>> One thing to take care of: scalar replacement relies on
>>>>>> TypeOopPtr::instance_id().
>>>>>>
>>>>>>     // If not InstanceTop or InstanceBot, indicates that this is
>>>>>>     // a particular instance of this type which is distinct.
>>>>>>     // This is the node index of the allocation node creating this
>>>>>> instance.
>>>>>>     int           _instance_id;
>>>>>>
>>>>>> It'll break when multiple allocations are in play.
>>>>>>
>>>>>> Best regards,
>>>>>> Vladimir Ivanov
>>>>>>
>>>>>> On 09.02.2022 04:45, Cesar Soares Lucas wrote:
>>>>>>> Hi there again!
>>>>>>>
>>>>>>> Can you please give me feedback on the following approach to at
>>>>>>> least partially
>>>>>>> address [1], the scalar replacement allocation merge issue?
>>>>>>>
>>>>>>> The problem that I am trying to solve arises when allocations are
>>>>>>> merged after a
>>>>>>> control flow split. The code below shows _one example_ of such a
>>>>>>> situation.
>>>>>>>
>>>>>>> public int ex1(boolean cond, int x, int y) {
>>>>>>>       Point p = new Point(x, y);
>>>>>>>       if (cond)
>>>>>>>           p = new Point(y, x);
>>>>>>>       // Allocations for p are merged here.
>>>>>>>       return p.calc();
>>>>>>> }
>>>>>>>
>>>>>>> Assuming the method calls on "p" are inlined then the allocations
>>>>>>> will not
>>>>>>> escape the method. The C2 IR for this method will look like this:
>>>>>>>
>>>>>>> public int ex1(boolean cond, int first, int second) {
>>>>>>>       p0 = Allocate(...);
>>>>>>>       ...
>>>>>>>       p0.x = first;
>>>>>>>       p0.y = second;
>>>>>>>
>>>>>>>       if (cond) {
>>>>>>>           p1 = Allocate(...);
>>>>>>>           ...
>>>>>>>           p1.x = second;
>>>>>>>           p1.y = first;
>>>>>>>       }
>>>>>>>
>>>>>>>       p = phi(p0, p1)
>>>>>>>
>>>>>>>       return p.x - p.y;
>>>>>>> }
>>>>>>>
>>>>>>> However, one of the constraints implemented here [2], specifically
>>>>>>> the third
>>>>>>> one, will prevent the objects from being scalar replaced.
>>>>>>>
>>>>>>> The approach that I'm considering for solving the problem is to
>>>>>>> replace the Phi
>>>>>>> node `p = phi(p0, p1)` with new Phi nodes for each of the fields
>>>>>>> of the objects
>>>>>>> in the original Phi. The IR for `ex1` would look something like
>>>>>>> this after the
>>>>>>> transformation:
>>>>>>>
>>>>>>> public int ex1(boolean cond, int first, int second) {
>>>>>>>       p0 = Allocate(...);
>>>>>>>       ...
>>>>>>>       p0.x = first;
>>>>>>>       p0.y = second;
>>>>>>>
>>>>>>>       if (cond) {
>>>>>>>           p1 = Allocate(...);
>>>>>>>           ...
>>>>>>>           p1.x = second;
>>>>>>>           p1.y = first;
>>>>>>>       }
>>>>>>>
>>>>>>>       pX = phi(first, second)
>>>>>>>       pY = phi(second, first)
>>>>>>>
>>>>>>>       return pX - pY;
>>>>>>> }
>>>>>>>
>>>>>>> I understand that this transformation might not be applicable for
>>>>>>> all cases and
>>>>>>> that it's not as simple as illustrated above. Also, it seems to me
>>>>>>> that much of
>>>>>>> what I'd have to implement is already implemented in other steps
>>>>>>> of the Scalar
>>>>>>> Replacement pipeline (which is a good thing). To work around these
>>>>>>> implementation details I plan to use as much of the existing code
>>>>>>> as possible.
>>>>>>> The algorithm for the transformation would be like this:
>>>>>>>
>>>>>>> split_phis(phi)
>>>>>>>       # If output of phi escapes, or something uses its identity, etc
>>>>>>>       # then we can't remove it. The conditions here might possible
>>>>>>> be the
>>>>>>>       # same as the ones implemented in
>>>>>>> `PhaseMacroExpand::can_eliminate_allocation`
>>>>>>>       if cant_remove_phi_output(phi)
>>>>>>>           return ;
>>>>>>>
>>>>>>>       # Collect a set of tuples(F,U) containing nodes U that uses
>>>>>>> field F
>>>>>>>       # member of the object resulting from `phi`.
>>>>>>>       fields_used = collect_fields_used_after_phi(phi)
>>>>>>>
>>>>>>>       foreach field in fields_used
>>>>>>>           producers = {}
>>>>>>>
>>>>>>>           # Create a list with the last Store for each field
>>>>>>> "field" on the
>>>>>>>           # scope of each of the Phi input objects.
>>>>>>>           foreach o in phi.inputs
>>>>>>>               # The function called below might re-use a lot of the
>>>>>>> code/logic in `PhaseMacroExpand::scalar_replacement`
>>>>>>>               producers += last_store_to_o_field(0, field)
>>>>>>>           # Create a new phi node whose inputs are the Store's to
>>>>>>> 'field'
>>>>>>>           field_phi = create_new_phi(producers)
>>>>>>>
>>>>>>>           update_consumers(field, field_phi)
>>>>>>>
>>>>>>> The implementation that I envisioned would be as a "pre-process"
>>>>>>> [3] step just
>>>>>>> after EA but before the constraint checks in
>>>>>>> `adjust_scalar_replaceable_state`
>>>>>>> [2]. If we agree that the overall Scalar Replacement
>>>>>>> implementation goes through
>>>>>>> the following major phases:
>>>>>>>
>>>>>>>       1. Identify the Escape Status of objects.
>>>>>>>       2. Adjust object Escape and/or Scalar Replacement status
>>>>>>> based on a set of constraints.
>>>>>>>       3. Make call to Split_unique_types [4].
>>>>>>>       4 Iterate over object and array allocations.
>>>>>>>           4.1 Check if allocation can be eliminated.
>>>>>>>           4.2 Perform scalar replacement. Replace uses of object in
>>>>>>> Safepoints.
>>>>>>>           4.3 Process users of CheckCastPP other than Safepoint:
>>>>>>> AddP, ArrayCopy and CastP2X.
>>>>>>>
>>>>>>> The transformation that I am proposing would change the overall
>>>>>>> flow to look
>>>>>>> like this:
>>>>>>>
>>>>>>>       1. Identify the Escape Status of objects.
>>>>>>>       2. ----> New: "Split phi functions" <----
>>>>>>>       2. Adjust object Escape and/or Scalar Replacement status
>>>>>>> based on a set of constraints.
>>>>>>>       3. Make call to Split_unique_types [14].
>>>>>>>       4 Iterate over object and array allocations.
>>>>>>>           4.1 ----> Moved to split_phi: "Check if allocation can be
>>>>>>> eliminated" <----
>>>>>>>           4.2 Perform scalar replacement. Replace uses of object in
>>>>>>> Safepoints.
>>>>>>>           4.3 Process users of CheckCastPP other than Safepoint:
>>>>>>> AddP, ArrayCopy and CastP2X.
>>>>>>>
>>>>>>> Please let me know what you think and thank you for taking the
>>>>>>> time to review
>>>>>>> this!
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Cesar
>>>>>>>
>>>>>>> Notes:
>>>>>>>
>>>>>>>       [1] I am not sure yet how this approach will play with the
>>>>>>> case of a merge
>>>>>>>           with NULL.
>>>>>>>       [2]
>>>>>>> https://urldefense.com/v3/__https://nam06.safelinks.protection.outlook.com/?url=https*3A*2F*2Fgithub.com*2Fopenjdk*2Fjdk*2Fblob*2F2f71a6b39ed6bb869b4eb3e81bc1d87f4b3328ff*2Fsrc*2Fhotspot*2Fshare*2Fopto*2Fescape.cpp*23L1809&data=04*7C01*7CDivino.Cesar*40microsoft.com*7Cf6f03bf26b724150d0e508d9eff1d0bf*7C72f988bf86f141af91ab2d7cd011db47*7C1*7C0*7C637804644159660386*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=j5KJXWywNe6eNmgD8vDXyhJTPhugsUFvivvocbiwnrE*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSUlJSUlJSU!!ACWV5N9M2RV99hQ!dCjbOcNY-TmA_XDaROQVmVTuzAFxM73migoYdKW9Qle3xgFD7PANME_5LXeGQ0ldiqERUSk$
>>>>>>>
>>>>>>>
>>>>>>>       [3] Another option would be to "patch" the current
>>>>>>> implementation to be able
>>>>>>>           to handle the merges. I am not certain that the "patch"
>>>>>>> approach would be
>>>>>>>           better, however, the "pre-process" approach is certainly
>>>>>>> much easier to test
>>>>>>>           and more readable.
>>>>>>>
>>>>>>>       [4] I cannot say I understand 100% the effects of executing
>>>>>>>           split_unique_types(). Would the transformation that I am
>>>>>>> proposing need to
>>>>>>>           be after the call to split_unique_types?