RFC : Approach to handle Allocation Merges in C2 Scalar Replacement

Tue Feb 15 16:14:49 UTC 2022

On 2/14/22 11:32 AM, Cesar Soares Lucas wrote:
> Here is my understanding from the discussion so far, please correct me if I'm
> wrong: Improving split_unique_types to make it able to handle "clusters" of
> allocations will not only make it easier to implement the "split through phi"
> transformation but also make it applicable to more complex merge situations.
> Did I get it right?

Yes. Please, see the last response from Vladimir I.

Thanks,
Vladimir K

> 
> 
> Regards,
> Cesar
> ________________________________________
> From: Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
> Sent: February 11, 2022 2:22 PM
> To: Vladimir Kozlov; Cesar Soares Lucas; hotspot compiler
> Cc: Brian Stafford; Martijn Verburg
> Subject: Re: RFC : Approach to handle Allocation Merges in C2 Scalar Replacement
> 
> 
>>>> For this simple case we can teach C2's IGVN to split fields loads
>>>> through Phi so that phi(p0, p1) is not used and allocations as well.
>>>> We can do that because we know that allocations and phi do not escape.
>>>
>>> Seems like split_unique_types() pass is a prerequisite to make such
>>> transformations predictable on a larger scale. It simplifies the
>>> memory graph and untangles instance-related memory operations from the
>>> rest of the graph.
>>
>> Can you elaborate on your suggestion?
>>
>> As you correctly pointed in your original post, EA and SR heavily depend
>> on _instance_id which is Allocation node's _idx value. And the solution
>> for this is either remove reference to merge phi. Or introduce new
>> "virtual allocation" which fields are populated by loads - but it is
>> again needs to split through phi.
> 
> My rough proposal is to group interfering non-escaping allocations into
> "clusters" and introduce separate memory slices for such clusters (as
> part of split_unique_types()). The clusters should not intersect,
> otherwise, the slice won't be unique anymore.
> 
> Once memory graph for unique alias is separated, it should be much
> simpler to further optimize it. If all memory operations are folded, the
> allocation can go away.
> 
> Splitting memory operations through phis should help untangle the
> allocations. So, even if one allocation can't be fully scalarized, it
> shouldn't keep other allocations from being scalarized (unless they alias).
> 
> I don't have a good understanding how new logic will interact with
> unique instance slices & _instance_id. Maybe a gradual transition from
> original graph to clustered slices to unique instance slices.
> 
> Best regards,
> Vladimir Ivanov
> 
>>>
>>> Otherwise, I believe wide memory states/merges would pose serious
>>> problems. E.g., when a load is split through a phi, you need to pick
>>> correct memory states for the new loads above the phi which quickly
>>> becomes quite a challenging task.
>>
>> Especially in loops. Actually I did tried such approach before and I
>> agree that it is hard.
>>
>> Thanks,
>> Vladimir K
>>
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>>>
>>>> public int ex1(boolean cond, int first, int second) {
>>>>       p0 = Allocate(...);
>>>>       ...
>>>>       p0.x = first;
>>>>       p0.y = second;
>>>>
>>>>       if (cond) {
>>>>           p1 = Allocate(...);
>>>>           ...
>>>>           p1.x = second;
>>>>           p1.y = first;
>>>>       }
>>>>
>>>>       p = phi(p0, p1) // unused and will be removed
>>>>
>>>>
>>>>       return phi(p0.x,p1.x) - phi(p0.y, p1.y);
>>>> }
>>>>
>>>> Thanks,
>>>> Vladimir K
>>>>
>>>> On 2/10/22 11:29 AM, Vladimir Ivanov wrote:
>>>>> (BCCing hotspot-dev and moving the discussion to hotspot-compiler-dev.)
>>>>>
>>>>> Hi Cesar,
>>>>>
>>>>> Thanks for looking into enhancing EA.
>>>>>
>>>>> Overall, the proposal looks reasonable.
>>>>>
>>>>> I suggest to look more closely at split_unique_types().
>>>>> It introduces a dedicated class of alias categories for fields of
>>>>> the allocation being eliminated and clones memory graph. I don't see
>>>>> why it shouldn't work for multiple allocations.
>>>>>
>>>>> Moreover, split_unique_types() will break if you start optimizing
>>>>> multiple allocations at once. The notion of unique alias should be
>>>>> adjusted and cover the union of unique aliases for all interacting
>>>>> allocations.
>>>>>
>>>>> Seems like you need to enhance SR to work on non-intersecting
>>>>> clusters of allocations.
>>>>>
>>>>> One thing to take care of: scalar replacement relies on
>>>>> TypeOopPtr::instance_id().
>>>>>
>>>>>     // If not InstanceTop or InstanceBot, indicates that this is
>>>>>     // a particular instance of this type which is distinct.
>>>>>     // This is the node index of the allocation node creating this
>>>>> instance.
>>>>>     int           _instance_id;
>>>>>
>>>>> It'll break when multiple allocations are in play.
>>>>>
>>>>> Best regards,
>>>>> Vladimir Ivanov
>>>>>
>>>>> On 09.02.2022 04:45, Cesar Soares Lucas wrote:
>>>>>> Hi there again!
>>>>>>
>>>>>> Can you please give me feedback on the following approach to at
>>>>>> least partially
>>>>>> address [1], the scalar replacement allocation merge issue?
>>>>>>
>>>>>> The problem that I am trying to solve arises when allocations are
>>>>>> merged after a
>>>>>> control flow split. The code below shows _one example_ of such a
>>>>>> situation.
>>>>>>
>>>>>> public int ex1(boolean cond, int x, int y) {
>>>>>>       Point p = new Point(x, y);
>>>>>>       if (cond)
>>>>>>           p = new Point(y, x);
>>>>>>       // Allocations for p are merged here.
>>>>>>       return p.calc();
>>>>>> }
>>>>>>
>>>>>> Assuming the method calls on "p" are inlined then the allocations
>>>>>> will not
>>>>>> escape the method. The C2 IR for this method will look like this:
>>>>>>
>>>>>> public int ex1(boolean cond, int first, int second) {
>>>>>>       p0 = Allocate(...);
>>>>>>       ...
>>>>>>       p0.x = first;
>>>>>>       p0.y = second;
>>>>>>
>>>>>>       if (cond) {
>>>>>>           p1 = Allocate(...);
>>>>>>           ...
>>>>>>           p1.x = second;
>>>>>>           p1.y = first;
>>>>>>       }
>>>>>>
>>>>>>       p = phi(p0, p1)
>>>>>>
>>>>>>       return p.x - p.y;
>>>>>> }
>>>>>>
>>>>>> However, one of the constraints implemented here [2], specifically
>>>>>> the third
>>>>>> one, will prevent the objects from being scalar replaced.
>>>>>>
>>>>>> The approach that I'm considering for solving the problem is to
>>>>>> replace the Phi
>>>>>> node `p = phi(p0, p1)` with new Phi nodes for each of the fields of
>>>>>> the objects
>>>>>> in the original Phi. The IR for `ex1` would look something like
>>>>>> this after the
>>>>>> transformation:
>>>>>>
>>>>>> public int ex1(boolean cond, int first, int second) {
>>>>>>       p0 = Allocate(...);
>>>>>>       ...
>>>>>>       p0.x = first;
>>>>>>       p0.y = second;
>>>>>>
>>>>>>       if (cond) {
>>>>>>           p1 = Allocate(...);
>>>>>>           ...
>>>>>>           p1.x = second;
>>>>>>           p1.y = first;
>>>>>>       }
>>>>>>
>>>>>>       pX = phi(first, second)
>>>>>>       pY = phi(second, first)
>>>>>>
>>>>>>       return pX - pY;
>>>>>> }
>>>>>>
>>>>>> I understand that this transformation might not be applicable for
>>>>>> all cases and
>>>>>> that it's not as simple as illustrated above. Also, it seems to me
>>>>>> that much of
>>>>>> what I'd have to implement is already implemented in other steps of
>>>>>> the Scalar
>>>>>> Replacement pipeline (which is a good thing). To work around these
>>>>>> implementation details I plan to use as much of the existing code
>>>>>> as possible.
>>>>>> The algorithm for the transformation would be like this:
>>>>>>
>>>>>> split_phis(phi)
>>>>>>       # If output of phi escapes, or something uses its identity, etc
>>>>>>       # then we can't remove it. The conditions here might possible
>>>>>> be the
>>>>>>       # same as the ones implemented in
>>>>>> `PhaseMacroExpand::can_eliminate_allocation`
>>>>>>       if cant_remove_phi_output(phi)
>>>>>>           return ;
>>>>>>
>>>>>>       # Collect a set of tuples(F,U) containing nodes U that uses
>>>>>> field F
>>>>>>       # member of the object resulting from `phi`.
>>>>>>       fields_used = collect_fields_used_after_phi(phi)
>>>>>>
>>>>>>       foreach field in fields_used
>>>>>>           producers = {}
>>>>>>
>>>>>>           # Create a list with the last Store for each field "field"
>>>>>> on the
>>>>>>           # scope of each of the Phi input objects.
>>>>>>           foreach o in phi.inputs
>>>>>>               # The function called below might re-use a lot of the
>>>>>> code/logic in `PhaseMacroExpand::scalar_replacement`
>>>>>>               producers += last_store_to_o_field(0, field)
>>>>>>           # Create a new phi node whose inputs are the Store's to
>>>>>> 'field'
>>>>>>           field_phi = create_new_phi(producers)
>>>>>>
>>>>>>           update_consumers(field, field_phi)
>>>>>>
>>>>>> The implementation that I envisioned would be as a "pre-process"
>>>>>> [3] step just
>>>>>> after EA but before the constraint checks in
>>>>>> `adjust_scalar_replaceable_state`
>>>>>> [2]. If we agree that the overall Scalar Replacement implementation
>>>>>> goes through
>>>>>> the following major phases:
>>>>>>
>>>>>>       1. Identify the Escape Status of objects.
>>>>>>       2. Adjust object Escape and/or Scalar Replacement status based
>>>>>> on a set of constraints.
>>>>>>       3. Make call to Split_unique_types [4].
>>>>>>       4 Iterate over object and array allocations.
>>>>>>           4.1 Check if allocation can be eliminated.
>>>>>>           4.2 Perform scalar replacement. Replace uses of object in
>>>>>> Safepoints.
>>>>>>           4.3 Process users of CheckCastPP other than Safepoint:
>>>>>> AddP, ArrayCopy and CastP2X.
>>>>>>
>>>>>> The transformation that I am proposing would change the overall
>>>>>> flow to look
>>>>>> like this:
>>>>>>
>>>>>>       1. Identify the Escape Status of objects.
>>>>>>       2. ----> New: "Split phi functions" <----
>>>>>>       2. Adjust object Escape and/or Scalar Replacement status based
>>>>>> on a set of constraints.
>>>>>>       3. Make call to Split_unique_types [14].
>>>>>>       4 Iterate over object and array allocations.
>>>>>>           4.1 ----> Moved to split_phi: "Check if allocation can be
>>>>>> eliminated" <----
>>>>>>           4.2 Perform scalar replacement. Replace uses of object in
>>>>>> Safepoints.
>>>>>>           4.3 Process users of CheckCastPP other than Safepoint:
>>>>>> AddP, ArrayCopy and CastP2X.
>>>>>>
>>>>>> Please let me know what you think and thank you for taking the time
>>>>>> to review
>>>>>> this!
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Cesar
>>>>>>
>>>>>> Notes:
>>>>>>
>>>>>>       [1] I am not sure yet how this approach will play with the
>>>>>> case of a merge
>>>>>>           with NULL.
>>>>>>       [2]
>>>>>> https://urldefense.com/v3/__https://nam06.safelinks.protection.outlook.com/?url=https*3A*2F*2Fgithub.com*2Fopenjdk*2Fjdk*2Fblob*2F2f71a6b39ed6bb869b4eb3e81bc1d87f4b3328ff*2Fsrc*2Fhotspot*2Fshare*2Fopto*2Fescape.cpp*23L1809&data=04*7C01*7CDivino.Cesar*40microsoft.com*7Cb8b9939a638d4382b64208d9edad0a0f*7C72f988bf86f141af91ab2d7cd011db47*7C1*7C0*7C637802149742125104*7CUnknown*7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0*3D*7C3000&sdata=lP*2BGg*2Fk4LBvQ3lCxYGlseHvOel8E8W2PW3bQNwqg5JQ*3D&reserved=0__;JSUlJSUlJSUlJSUlJSUlJSUlJSUlJSUlJSUlJQ!!ACWV5N9M2RV99hQ!flGO3Lw7WYjGc9E5njiYUEspjG1XnYdeTgiXRpXIbsMuMdMbvoMbWHOzTOfrZbYRaPqLSg$
>>>>>>
>>>>>>
>>>>>>       [3] Another option would be to "patch" the current
>>>>>> implementation to be able
>>>>>>           to handle the merges. I am not certain that the "patch"
>>>>>> approach would be
>>>>>>           better, however, the "pre-process" approach is certainly
>>>>>> much easier to test
>>>>>>           and more readable.
>>>>>>
>>>>>>       [4] I cannot say I understand 100% the effects of executing
>>>>>>           split_unique_types(). Would the transformation that I am
>>>>>> proposing need to
>>>>>>           be after the call to split_unique_types?