RFC : Approach to handle Allocation Merges in C2 Scalar Replacement

Fri Feb 18 17:53:18 UTC 2022

Hi Vladimir, thank you for sharing this idea.

I with the selector-based base splitting idea using some examples and I think I understand the mechanics well. It's conceptually simple and quite effective, I like it! I've another question, though. From your previous message I had understood that we needed the clustering because split_unique_types can't handle multiple bases at once, which makes sense to me. _This time I don't understand why we need to do the clustering before doing the selector-based transformation_. AFAIU this transformation won't be creating any memory slices but rather just assigning IDs to different bases and then creating specialized memory operations conditionally using the different bases.

Regards,
Cesar

________________________________________
From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
Sent: February 17, 2022 7:37 AM
To: Vladimir Kozlov; Cesar Soares Lucas; hotspot compiler
Cc: Brian Stafford; Martijn Verburg
Subject: Re: RFC : Approach to handle Allocation Merges in C2 Scalar Replacement

> But depending on the actual shape of the base pointer (here it's
> Phi(obj1,obj2)), it may become infeasible (from performance perspective)
> or even impossible (e.g., for complex loop variant conditions) to
> separate instance slices.

I'd like to correct myself. After thinking more about the problem, I see
a universal way to scalarize clusters of interfering non-escaping
allocations even in presence of accesses with indeterminate base.

The crucial property for such cluster of interacting non-escaping
allocations is for any memory access its base is either part of the
cluster or not (it's a static property for every access). Then, when it
is part of the cluster, then the number of possible base values at
runtime is finite and is a subset of the cluster.

So, memory graph for the cluster can be scalarized as follows:

   (1) separate memory graph for the cluster

   (2) on the newly constructed graph:

      (a) replace every base pointer with an ID ("selector id") and
recreate data graph for selector IDs from base pointers graph;

      (b) for accesses with indeterminate base pointer, replace them
with a merge of states from relevant allocations selected by "selector id"

      (c) after 2b, all memory accesses from the cluster memory graph
should have fixed base pointing at a particular allocation

   (3) scalarize individual allocations from the cluster one by one
(since they are independent now)

      - additional transformation (like, split-through-phi) should
simplify the graph by reducing the number of cases at merge points to
care about (ideally, leaving only a single one);

As an illustration:

   val = Load (AddP (Phi R base1 base2) offset) mem # load A.i

==(pseudo-code)==>

   selector = (Phi R #base1 #base2)

   if (selector == #base1) {
     val1 = Load (base1 + offset) mem
   } else if (selector == #base2) {
     val2 = Load (base2 + offset) mem
   } else {
     halt(); // impossible
   }

   val = Phi(R1, val1, val2)

======

   Store (AddP (Phi R base1 base2) offset) mem v # store A.i v

==(pseudo-code)==>

   selector = (Phi R #base1 #base2)

   if (selector == #base1) {
     Store base1 mem v ==> mem1
   } else if (selector == #base2) {
     Store base2 mem v ==> mem2
   } else {
     halt(); // impossible
   }

   mem = Phi(R1, mem1, mem2);

Best regards,
Vladimir Ivanov

>
> So, gradual approach seems the way to go:
>    (1) split memory graph for clusters of interacting non-escaping
> allocations;
>
>    (2) perform adhoc transformation targeted at untangling aliasing
> accesses (e.g, split-through-phi);
>
>    (3) extract unique instance types where possible, thus making the
> corresponding allocation scalar replaceable
>
> =====================================================================
>
> Also, one observation: interacting allocations don't have to be of the
> same class.
>
>      static int testPolluted(int r, boolean b) {
>          A obj1 = new B1(r);
>          A obj2 = new B2(r);
>
>          A obj = (b ? obj1 : obj2);
>          for (int i = 1; i < r; i++) {
>              obj.i++;
>          }
>
>          return obj1.i + obj2.i;
>      }
>
> Requiring unique instance types to perform SR (and not enhancing SR to
> handle aliasing allocations case) helps avoid some challenges in
> instance rematerialization logic at safepoints, because actual shape of
> the scalarized object (its class and exact set of fields) becomes a
> runtime property.
>
> Best regards,
> Vladimir Ivanov
>
>>>>
>>>> Otherwise, I believe wide memory states/merges would pose serious
>>>> problems. E.g., when a load is split through a phi, you need to pick
>>>> correct memory states for the new loads above the phi which quickly
>>>> becomes quite a challenging task.
>>>
>>> Especially in loops. Actually I did tried such approach before and I
>>> agree that it is hard.
>>>
>>> Thanks,
>>> Vladimir K
>>>
>>>>
>>>> Best regards,
>>>> Vladimir Ivanov
>>>>
>>>>>
>>>>> public int ex1(boolean cond, int first, int second) {
>>>>>      p0 = Allocate(...);
>>>>>      ...
>>>>>      p0.x = first;
>>>>>      p0.y = second;
>>>>>
>>>>>      if (cond) {
>>>>>          p1 = Allocate(...);
>>>>>          ...
>>>>>          p1.x = second;
>>>>>          p1.y = first;
>>>>>      }
>>>>>
>>>>>      p = phi(p0, p1) // unused and will be removed
>>>>>
>>>>>
>>>>>      return phi(p0.x,p1.x) - phi(p0.y, p1.y);
>>>>> }
>>>>>
>>>>> Thanks,
>>>>> Vladimir K
>>>>>
>>>>> On 2/10/22 11:29 AM, Vladimir Ivanov wrote:
>>>>>> (BCCing hotspot-dev and moving the discussion to
>>>>>> hotspot-compiler-dev.)
>>>>>>
>>>>>> Hi Cesar,
>>>>>>
>>>>>> Thanks for looking into enhancing EA.
>>>>>>
>>>>>> Overall, the proposal looks reasonable.
>>>>>>
>>>>>> I suggest to look more closely at split_unique_types().
>>>>>> It introduces a dedicated class of alias categories for fields of
>>>>>> the allocation being eliminated and clones memory graph. I don't
>>>>>> see why it shouldn't work for multiple allocations.
>>>>>>
>>>>>> Moreover, split_unique_types() will break if you start optimizing
>>>>>> multiple allocations at once. The notion of unique alias should be
>>>>>> adjusted and cover the union of unique aliases for all interacting
>>>>>> allocations.
>>>>>>
>>>>>> Seems like you need to enhance SR to work on non-intersecting
>>>>>> clusters of allocations.
>>>>>>
>>>>>> One thing to take care of: scalar replacement relies on
>>>>>> TypeOopPtr::instance_id().
>>>>>>
>>>>>>    // If not InstanceTop or InstanceBot, indicates that this is
>>>>>>    // a particular instance of this type which is distinct.
>>>>>>    // This is the node index of the allocation node creating this
>>>>>> instance.
>>>>>>    int           _instance_id;
>>>>>>
>>>>>> It'll break when multiple allocations are in play.
>>>>>>
>>>>>> Best regards,
>>>>>> Vladimir Ivanov
>>>>>>
>>>>>> On 09.02.2022 04:45, Cesar Soares Lucas wrote:
>>>>>>> Hi there again!
>>>>>>>
>>>>>>> Can you please give me feedback on the following approach to at
>>>>>>> least partially
>>>>>>> address [1], the scalar replacement allocation merge issue?
>>>>>>>
>>>>>>> The problem that I am trying to solve arises when allocations are
>>>>>>> merged after a
>>>>>>> control flow split. The code below shows _one example_ of such a
>>>>>>> situation.
>>>>>>>
>>>>>>> public int ex1(boolean cond, int x, int y) {
>>>>>>>      Point p = new Point(x, y);
>>>>>>>      if (cond)
>>>>>>>          p = new Point(y, x);
>>>>>>>      // Allocations for p are merged here.
>>>>>>>      return p.calc();
>>>>>>> }
>>>>>>>
>>>>>>> Assuming the method calls on "p" are inlined then the allocations
>>>>>>> will not
>>>>>>> escape the method. The C2 IR for this method will look like this:
>>>>>>>
>>>>>>> public int ex1(boolean cond, int first, int second) {
>>>>>>>      p0 = Allocate(...);
>>>>>>>      ...
>>>>>>>      p0.x = first;
>>>>>>>      p0.y = second;
>>>>>>>
>>>>>>>      if (cond) {
>>>>>>>          p1 = Allocate(...);
>>>>>>>          ...
>>>>>>>          p1.x = second;
>>>>>>>          p1.y = first;
>>>>>>>      }
>>>>>>>
>>>>>>>      p = phi(p0, p1)
>>>>>>>
>>>>>>>      return p.x - p.y;
>>>>>>> }
>>>>>>>
>>>>>>> However, one of the constraints implemented here [2],
>>>>>>> specifically the third
>>>>>>> one, will prevent the objects from being scalar replaced.
>>>>>>>
>>>>>>> The approach that I'm considering for solving the problem is to
>>>>>>> replace the Phi
>>>>>>> node `p = phi(p0, p1)` with new Phi nodes for each of the fields
>>>>>>> of the objects
>>>>>>> in the original Phi. The IR for `ex1` would look something like
>>>>>>> this after the
>>>>>>> transformation:
>>>>>>>
>>>>>>> public int ex1(boolean cond, int first, int second) {
>>>>>>>      p0 = Allocate(...);
>>>>>>>      ...
>>>>>>>      p0.x = first;
>>>>>>>      p0.y = second;
>>>>>>>
>>>>>>>      if (cond) {
>>>>>>>          p1 = Allocate(...);
>>>>>>>          ...
>>>>>>>          p1.x = second;
>>>>>>>          p1.y = first;
>>>>>>>      }
>>>>>>>
>>>>>>>      pX = phi(first, second)
>>>>>>>      pY = phi(second, first)
>>>>>>>
>>>>>>>      return pX - pY;
>>>>>>> }
>>>>>>>
>>>>>>> I understand that this transformation might not be applicable for
>>>>>>> all cases and
>>>>>>> that it's not as simple as illustrated above. Also, it seems to
>>>>>>> me that much of
>>>>>>> what I'd have to implement is already implemented in other steps
>>>>>>> of the Scalar
>>>>>>> Replacement pipeline (which is a good thing). To work around these
>>>>>>> implementation details I plan to use as much of the existing code
>>>>>>> as possible.
>>>>>>> The algorithm for the transformation would be like this:
>>>>>>>
>>>>>>> split_phis(phi)
>>>>>>>      # If output of phi escapes, or something uses its identity, etc
>>>>>>>      # then we can't remove it. The conditions here might
>>>>>>> possible be the
>>>>>>>      # same as the ones implemented in
>>>>>>> `PhaseMacroExpand::can_eliminate_allocation`
>>>>>>>      if cant_remove_phi_output(phi)
>>>>>>>          return ;
>>>>>>>
>>>>>>>      # Collect a set of tuples(F,U) containing nodes U that uses
>>>>>>> field F
>>>>>>>      # member of the object resulting from `phi`.
>>>>>>>      fields_used = collect_fields_used_after_phi(phi)
>>>>>>>
>>>>>>>      foreach field in fields_used
>>>>>>>          producers = {}
>>>>>>>
>>>>>>>          # Create a list with the last Store for each field
>>>>>>> "field" on the
>>>>>>>          # scope of each of the Phi input objects.
>>>>>>>          foreach o in phi.inputs
>>>>>>>              # The function called below might re-use a lot of
>>>>>>> the code/logic in `PhaseMacroExpand::scalar_replacement`
>>>>>>>              producers += last_store_to_o_field(0, field)
>>>>>>>          # Create a new phi node whose inputs are the Store's to
>>>>>>> 'field'
>>>>>>>          field_phi = create_new_phi(producers)
>>>>>>>
>>>>>>>          update_consumers(field, field_phi)
>>>>>>>
>>>>>>> The implementation that I envisioned would be as a "pre-process"
>>>>>>> [3] step just
>>>>>>> after EA but before the constraint checks in
>>>>>>> `adjust_scalar_replaceable_state`
>>>>>>> [2]. If we agree that the overall Scalar Replacement
>>>>>>> implementation goes through
>>>>>>> the following major phases:
>>>>>>>
>>>>>>>      1. Identify the Escape Status of objects.
>>>>>>>      2. Adjust object Escape and/or Scalar Replacement status
>>>>>>> based on a set of constraints.
>>>>>>>      3. Make call to Split_unique_types [4].
>>>>>>>      4 Iterate over object and array allocations.
>>>>>>>          4.1 Check if allocation can be eliminated.
>>>>>>>          4.2 Perform scalar replacement. Replace uses of object
>>>>>>> in Safepoints.
>>>>>>>          4.3 Process users of CheckCastPP other than Safepoint:
>>>>>>> AddP, ArrayCopy and CastP2X.
>>>>>>>
>>>>>>> The transformation that I am proposing would change the overall
>>>>>>> flow to look
>>>>>>> like this:
>>>>>>>
>>>>>>>      1. Identify the Escape Status of objects.
>>>>>>>      2. ----> New: "Split phi functions" <----
>>>>>>>      2. Adjust object Escape and/or Scalar Replacement status
>>>>>>> based on a set of constraints.
>>>>>>>      3. Make call to Split_unique_types [14].
>>>>>>>      4 Iterate over object and array allocations.
>>>>>>>          4.1 ----> Moved to split_phi: "Check if allocation can
>>>>>>> be eliminated" <----
>>>>>>>          4.2 Perform scalar replacement. Replace uses of object
>>>>>>> in Safepoints.
>>>>>>>          4.3 Process users of CheckCastPP other than Safepoint:
>>>>>>> AddP, ArrayCopy and CastP2X.
>>>>>>>
>>>>>>> Please let me know what you think and thank you for taking the
>>>>>>> time to review
>>>>>>> this!
>>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>> Cesar
>>>>>>>
>>>>>>> Notes:
>>>>>>>
>>>>>>>      [1] I am not sure yet how this approach will play with the
>>>>>>> case of a merge
>>>>>>>          with NULL.
>>>>>>>      [2]
>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopenjdk%2Fjdk%2Fblob%2F2f71a6b39ed6bb869b4eb3e81bc1d87f4b3328ff%2Fsrc%2Fhotspot%2Fshare%2Fopto%2Fescape.cpp%23L1809&data=04%7C01%7CDivino.Cesar%40microsoft.com%7Ce13b0a3de30e43d7ac4408d9f22b75c5%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637807090747649306%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=WruWqIgGVIzIFIG0gD0%2FEIQm40euLx6FZMexotJVrlE%3D&reserved=0
>>>>>>>
>>>>>>>
>>>>>>>      [3] Another option would be to "patch" the current
>>>>>>> implementation to be able
>>>>>>>          to handle the merges. I am not certain that the "patch"
>>>>>>> approach would be
>>>>>>>          better, however, the "pre-process" approach is certainly
>>>>>>> much easier to test
>>>>>>>          and more readable.
>>>>>>>
>>>>>>>      [4] I cannot say I understand 100% the effects of executing
>>>>>>>          split_unique_types(). Would the transformation that I am
>>>>>>> proposing need to
>>>>>>>          be after the call to split_unique_types?