[External] : Re: RFC : Approach to handle Allocation Merges in C2 Scalar Replacement

Tue Feb 22 17:59:51 UTC 2022

Hi Vladimir,

I think I've enough feedback to start working on a prototype and see how things play out. Thank you for sharing your ideas/expertise!

Cheers,
Cesar

________________________________________
From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
Sent: February 21, 2022 10:26 AM
To: Cesar Soares Lucas; Vladimir Kozlov; hotspot compiler
Cc: Brian Stafford; Martijn Verburg
Subject: Re: [External] : Re: RFC : Approach to handle Allocation Merges in C2 Scalar Replacement

Hi Cesar,

> I with the selector-based base splitting idea using some examples and I think I understand the mechanics well. It's conceptually simple and quite effective, I like it! I've another question, though. From your previous message I had understood that we needed the clustering because split_unique_types can't handle multiple bases at once, which makes sense to me. _This time I don't understand why we need to do the clustering before doing the selector-based transformation_. AFAIU this transformation won't be creating any memory slices but rather just assigning IDs to different bases and then creating specialized memory operations conditionally using the different bases.

Yes, that's a valid simplification. It does sound like introducing
selector ID doesn't require a separate memory graph (and can be
performed before split_unique_types()), but I don't have a good
understanding how complicated simplifying selector-based IR shapes would
be compared to original (memory op-based) ones. It's definitely worth
experimenting with.

Best regards,
Vladimir Ivanov

> ________________________________________
> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> Sent: February 17, 2022 7:37 AM
> To: Vladimir Kozlov; Cesar Soares Lucas; hotspot compiler
> Cc: Brian Stafford; Martijn Verburg
> Subject: Re: RFC : Approach to handle Allocation Merges in C2 Scalar Replacement
>
>
>> But depending on the actual shape of the base pointer (here it's
>> Phi(obj1,obj2)), it may become infeasible (from performance perspective)
>> or even impossible (e.g., for complex loop variant conditions) to
>> separate instance slices.
>
> I'd like to correct myself. After thinking more about the problem, I see
> a universal way to scalarize clusters of interfering non-escaping
> allocations even in presence of accesses with indeterminate base.
>
> The crucial property for such cluster of interacting non-escaping
> allocations is for any memory access its base is either part of the
> cluster or not (it's a static property for every access). Then, when it
> is part of the cluster, then the number of possible base values at
> runtime is finite and is a subset of the cluster.
>
> So, memory graph for the cluster can be scalarized as follows:
>
>     (1) separate memory graph for the cluster
>
>     (2) on the newly constructed graph:
>
>        (a) replace every base pointer with an ID ("selector id") and
> recreate data graph for selector IDs from base pointers graph;
>
>        (b) for accesses with indeterminate base pointer, replace them
> with a merge of states from relevant allocations selected by "selector id"
>
>        (c) after 2b, all memory accesses from the cluster memory graph
> should have fixed base pointing at a particular allocation
>
>     (3) scalarize individual allocations from the cluster one by one
> (since they are independent now)
>
>        - additional transformation (like, split-through-phi) should
> simplify the graph by reducing the number of cases at merge points to
> care about (ideally, leaving only a single one);
>
>
>
> As an illustration:
>
>     val = Load (AddP (Phi R base1 base2) offset) mem # load A.i
>
> ==(pseudo-code)==>
>
>     selector = (Phi R #base1 #base2)
>
>     if (selector == #base1) {
>       val1 = Load (base1 + offset) mem
>     } else if (selector == #base2) {
>       val2 = Load (base2 + offset) mem
>     } else {
>       halt(); // impossible
>     }
>
>     val = Phi(R1, val1, val2)
>
> ======
>
>     Store (AddP (Phi R base1 base2) offset) mem v # store A.i v
>
> ==(pseudo-code)==>
>
>     selector = (Phi R #base1 #base2)
>
>     if (selector == #base1) {
>       Store base1 mem v ==> mem1
>     } else if (selector == #base2) {
>       Store base2 mem v ==> mem2
>     } else {
>       halt(); // impossible
>     }
>
>     mem = Phi(R1, mem1, mem2);
>
>
> Best regards,
> Vladimir Ivanov
>
>>
>> So, gradual approach seems the way to go:
>>     (1) split memory graph for clusters of interacting non-escaping
>> allocations;
>>
>>     (2) perform adhoc transformation targeted at untangling aliasing
>> accesses (e.g, split-through-phi);
>>
>>     (3) extract unique instance types where possible, thus making the
>> corresponding allocation scalar replaceable
>>
>> =====================================================================
>>
>> Also, one observation: interacting allocations don't have to be of the
>> same class.
>>
>>       static int testPolluted(int r, boolean b) {
>>           A obj1 = new B1(r);
>>           A obj2 = new B2(r);
>>
>>           A obj = (b ? obj1 : obj2);
>>           for (int i = 1; i < r; i++) {
>>               obj.i++;
>>           }
>>
>>           return obj1.i + obj2.i;
>>       }
>>
>> Requiring unique instance types to perform SR (and not enhancing SR to
>> handle aliasing allocations case) helps avoid some challenges in
>> instance rematerialization logic at safepoints, because actual shape of
>> the scalarized object (its class and exact set of fields) becomes a
>> runtime property.
>>
>> Best regards,
>> Vladimir Ivanov
>>
>>>>>
>>>>> Otherwise, I believe wide memory states/merges would pose serious
>>>>> problems. E.g., when a load is split through a phi, you need to pick
>>>>> correct memory states for the new loads above the phi which quickly
>>>>> becomes quite a challenging task.
>>>>
>>>> Especially in loops. Actually I did tried such approach before and I
>>>> agree that it is hard.
>>>>
>>>> Thanks,
>>>> Vladimir K
>>>>
>>>>>
>>>>> Best regards,
>>>>> Vladimir Ivanov
>>>>>
>>>>>>
>>>>>> public int ex1(boolean cond, int first, int second) {
>>>>>>       p0 = Allocate(...);
>>>>>>       ...
>>>>>>       p0.x = first;
>>>>>>       p0.y = second;
>>>>>>
>>>>>>       if (cond) {
>>>>>>           p1 = Allocate(...);
>>>>>>           ...
>>>>>>           p1.x = second;
>>>>>>           p1.y = first;
>>>>>>       }
>>>>>>
>>>>>>       p = phi(p0, p1) // unused and will be removed
>>>>>>
>>>>>>
>>>>>>       return phi(p0.x,p1.x) - phi(p0.y, p1.y);
>>>>>> }
>>>>>>
>>>>>> Thanks,
>>>>>> Vladimir K
>>>>>>
>>>>>> On 2/10/22 11:29 AM, Vladimir Ivanov wrote:
>>>>>>> (BCCing hotspot-dev and moving the discussion to
>>>>>>> hotspot-compiler-dev.)
>>>>>>>
>>>>>>> Hi Cesar,
>>>>>>>
>>>>>>> Thanks for looking into enhancing EA.
>>>>>>>
>>>>>>> Overall, the proposal looks reasonable.
>>>>>>>
>>>>>>> I suggest to look more closely at split_unique_types().
>>>>>>> It introduces a dedicated class of alias categories for fields of
>>>>>>> the allocation being eliminated and clones memory graph. I don't
>>>>>>> see why it shouldn't work for multiple allocations.
>>>>>>>
>>>>>>> Moreover, split_unique_types() will break if you start optimizing
>>>>>>> multiple allocations at once. The notion of unique alias should be
>>>>>>> adjusted and cover the union of unique aliases for all interacting
>>>>>>> allocations.
>>>>>>>
>>>>>>> Seems like you need to enhance SR to work on non-intersecting
>>>>>>> clusters of allocations.
>>>>>>>
>>>>>>> One thing to take care of: scalar replacement relies on
>>>>>>> TypeOopPtr::instance_id().
>>>>>>>
>>>>>>>     // If not InstanceTop or InstanceBot, indicates that this is
>>>>>>>     // a particular instance of this type which is distinct.
>>>>>>>     // This is the node index of the allocation node creating this
>>>>>>> instance.
>>>>>>>     int           _instance_id;
>>>>>>>
>>>>>>> It'll break when multiple allocations are in play.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Vladimir Ivanov
>>>>>>>
>>>>>>> On 09.02.2022 04:45, Cesar Soares Lucas wrote:
>>>>>>>> Hi there again!
>>>>>>>>
>>>>>>>> Can you please give me feedback on the following approach to at
>>>>>>>> least partially
>>>>>>>> address [1], the scalar replacement allocation merge issue?
>>>>>>>>
>>>>>>>> The problem that I am trying to solve arises when allocations are
>>>>>>>> merged after a
>>>>>>>> control flow split. The code below shows _one example_ of such a
>>>>>>>> situation.
>>>>>>>>
>>>>>>>> public int ex1(boolean cond, int x, int y) {
>>>>>>>>       Point p = new Point(x, y);
>>>>>>>>       if (cond)
>>>>>>>>           p = new Point(y, x);
>>>>>>>>       // Allocations for p are merged here.
>>>>>>>>       return p.calc();
>>>>>>>> }
>>>>>>>>
>>>>>>>> Assuming the method calls on "p" are inlined then the allocations
>>>>>>>> will not
>>>>>>>> escape the method. The C2 IR for this method will look like this:
>>>>>>>>
>>>>>>>> public int ex1(boolean cond, int first, int second) {
>>>>>>>>       p0 = Allocate(...);
>>>>>>>>       ...
>>>>>>>>       p0.x = first;
>>>>>>>>       p0.y = second;
>>>>>>>>
>>>>>>>>       if (cond) {
>>>>>>>>           p1 = Allocate(...);
>>>>>>>>           ...
>>>>>>>>           p1.x = second;
>>>>>>>>           p1.y = first;
>>>>>>>>       }
>>>>>>>>
>>>>>>>>       p = phi(p0, p1)
>>>>>>>>
>>>>>>>>       return p.x - p.y;
>>>>>>>> }
>>>>>>>>
>>>>>>>> However, one of the constraints implemented here [2],
>>>>>>>> specifically the third
>>>>>>>> one, will prevent the objects from being scalar replaced.
>>>>>>>>
>>>>>>>> The approach that I'm considering for solving the problem is to
>>>>>>>> replace the Phi
>>>>>>>> node `p = phi(p0, p1)` with new Phi nodes for each of the fields
>>>>>>>> of the objects
>>>>>>>> in the original Phi. The IR for `ex1` would look something like
>>>>>>>> this after the
>>>>>>>> transformation:
>>>>>>>>
>>>>>>>> public int ex1(boolean cond, int first, int second) {
>>>>>>>>       p0 = Allocate(...);
>>>>>>>>       ...
>>>>>>>>       p0.x = first;
>>>>>>>>       p0.y = second;
>>>>>>>>
>>>>>>>>       if (cond) {
>>>>>>>>           p1 = Allocate(...);
>>>>>>>>           ...
>>>>>>>>           p1.x = second;
>>>>>>>>           p1.y = first;
>>>>>>>>       }
>>>>>>>>
>>>>>>>>       pX = phi(first, second)
>>>>>>>>       pY = phi(second, first)
>>>>>>>>
>>>>>>>>       return pX - pY;
>>>>>>>> }
>>>>>>>>
>>>>>>>> I understand that this transformation might not be applicable for
>>>>>>>> all cases and
>>>>>>>> that it's not as simple as illustrated above. Also, it seems to
>>>>>>>> me that much of
>>>>>>>> what I'd have to implement is already implemented in other steps
>>>>>>>> of the Scalar
>>>>>>>> Replacement pipeline (which is a good thing). To work around these
>>>>>>>> implementation details I plan to use as much of the existing code
>>>>>>>> as possible.
>>>>>>>> The algorithm for the transformation would be like this:
>>>>>>>>
>>>>>>>> split_phis(phi)
>>>>>>>>       # If output of phi escapes, or something uses its identity, etc
>>>>>>>>       # then we can't remove it. The conditions here might
>>>>>>>> possible be the
>>>>>>>>       # same as the ones implemented in
>>>>>>>> `PhaseMacroExpand::can_eliminate_allocation`
>>>>>>>>       if cant_remove_phi_output(phi)
>>>>>>>>           return ;
>>>>>>>>
>>>>>>>>       # Collect a set of tuples(F,U) containing nodes U that uses
>>>>>>>> field F
>>>>>>>>       # member of the object resulting from `phi`.
>>>>>>>>       fields_used = collect_fields_used_after_phi(phi)
>>>>>>>>
>>>>>>>>       foreach field in fields_used
>>>>>>>>           producers = {}
>>>>>>>>
>>>>>>>>           # Create a list with the last Store for each field
>>>>>>>> "field" on the
>>>>>>>>           # scope of each of the Phi input objects.
>>>>>>>>           foreach o in phi.inputs
>>>>>>>>               # The function called below might re-use a lot of
>>>>>>>> the code/logic in `PhaseMacroExpand::scalar_replacement`
>>>>>>>>               producers += last_store_to_o_field(0, field)
>>>>>>>>           # Create a new phi node whose inputs are the Store's to
>>>>>>>> 'field'
>>>>>>>>           field_phi = create_new_phi(producers)
>>>>>>>>
>>>>>>>>           update_consumers(field, field_phi)
>>>>>>>>
>>>>>>>> The implementation that I envisioned would be as a "pre-process"
>>>>>>>> [3] step just
>>>>>>>> after EA but before the constraint checks in
>>>>>>>> `adjust_scalar_replaceable_state`
>>>>>>>> [2]. If we agree that the overall Scalar Replacement
>>>>>>>> implementation goes through
>>>>>>>> the following major phases:
>>>>>>>>
>>>>>>>>       1. Identify the Escape Status of objects.
>>>>>>>>       2. Adjust object Escape and/or Scalar Replacement status
>>>>>>>> based on a set of constraints.
>>>>>>>>       3. Make call to Split_unique_types [4].
>>>>>>>>       4 Iterate over object and array allocations.
>>>>>>>>           4.1 Check if allocation can be eliminated.
>>>>>>>>           4.2 Perform scalar replacement. Replace uses of object
>>>>>>>> in Safepoints.
>>>>>>>>           4.3 Process users of CheckCastPP other than Safepoint:
>>>>>>>> AddP, ArrayCopy and CastP2X.
>>>>>>>>
>>>>>>>> The transformation that I am proposing would change the overall
>>>>>>>> flow to look
>>>>>>>> like this:
>>>>>>>>
>>>>>>>>       1. Identify the Escape Status of objects.
>>>>>>>>       2. ----> New: "Split phi functions" <----
>>>>>>>>       2. Adjust object Escape and/or Scalar Replacement status
>>>>>>>> based on a set of constraints.
>>>>>>>>       3. Make call to Split_unique_types [14].
>>>>>>>>       4 Iterate over object and array allocations.
>>>>>>>>           4.1 ----> Moved to split_phi: "Check if allocation can
>>>>>>>> be eliminated" <----
>>>>>>>>           4.2 Perform scalar replacement. Replace uses of object
>>>>>>>> in Safepoints.
>>>>>>>>           4.3 Process users of CheckCastPP other than Safepoint:
>>>>>>>> AddP, ArrayCopy and CastP2X.
>>>>>>>>
>>>>>>>> Please let me know what you think and thank you for taking the
>>>>>>>> time to review
>>>>>>>> this!
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Cesar
>>>>>>>>
>>>>>>>> Notes:
>>>>>>>>
>>>>>>>>       [1] I am not sure yet how this approach will play with the
>>>>>>>> case of a merge
>>>>>>>>           with NULL.
>>>>>>>>       [2]
>>>>>>>> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.com%2Fv3%2F__https%3A%2F%2Fnam06.safelinks.protection.outlook.com%2F%3Furl%3Dhttps*3A*2F*2Fgithub.com*2Fopenjdk*2Fjdk*2Fblob*2F2f71a6b39ed6bb869b4eb3e81bc1d87f4b3328ff*2Fsrc*2Fhotspot*2Fshare*2Fopto*2Fescape.cpp*23L1809%26amp%3Bdata%3D04*7C01*7CDivino.Cesar*40microsoft.com*7Ce13b0a3de30e43d7ac4408d9f22b75c5*7C72f988bf86f141af91ab2d7cd011db47*7C1*7C0*7C637807090747649306*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000%26amp%3Bsdata%3DWruWqIgGVIzIFIG0gD0*2FEIQm40euLx6FZMexotJVrlE*3D%26amp%3Breserved%3D0__%3BJSUlJSUlJSUlJSUlJSUlJSUlJSUlJSUlJSUl!!ACWV5N9M2RV99hQ!aWncAwAPjDG2SCD4yFD_Rj6ecIB4iHjD2gepnv2TCHRO5RJznILvC8eoFUe4wgtAURqRWDc%24&data=04%7C01%7CDivino.Cesar%40microsoft.com%7Ca6d5d5772dde4200119608d9f567aa3c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637810647855476172%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=2%2Bkgn5CKitCdFdye5WyCkTut1kdtzld4UwMsTL%2B2yS4%3D&reserved=0
>>>>>>>>
>>>>>>>>
>>>>>>>>       [3] Another option would be to "patch" the current
>>>>>>>> implementation to be able
>>>>>>>>           to handle the merges. I am not certain that the "patch"
>>>>>>>> approach would be
>>>>>>>>           better, however, the "pre-process" approach is certainly
>>>>>>>> much easier to test
>>>>>>>>           and more readable.
>>>>>>>>
>>>>>>>>       [4] I cannot say I understand 100% the effects of executing
>>>>>>>>           split_unique_types(). Would the transformation that I am
>>>>>>>> proposing need to
>>>>>>>>           be after the call to split_unique_types?