RFC : Approach to handle Allocation Merges in C2 Scalar Replacement

Fri Feb 11 18:24:42 UTC 2022

> For this simple case we can teach C2's IGVN to split fields loads 
> through Phi so that phi(p0, p1) is not used and allocations as well. We 
> can do that because we know that allocations and phi do not escape.

Seems like split_unique_types() pass is a prerequisite to make such 
transformations predictable on a larger scale. It simplifies the memory 
graph and untangles instance-related memory operations from the rest of 
the graph.

Otherwise, I believe wide memory states/merges would pose serious 
problems. E.g., when a load is split through a phi, you need to pick 
correct memory states for the new loads above the phi which quickly 
becomes quite a challenging task.

Best regards,
Vladimir Ivanov

> 
> public int ex1(boolean cond, int first, int second) {
>      p0 = Allocate(...);
>      ...
>      p0.x = first;
>      p0.y = second;
> 
>      if (cond) {
>          p1 = Allocate(...);
>          ...
>          p1.x = second;
>          p1.y = first;
>      }
> 
>      p = phi(p0, p1) // unused and will be removed
> 
> 
>      return phi(p0.x,p1.x) - phi(p0.y, p1.y);
> }
> 
> Thanks,
> Vladimir K
> 
> On 2/10/22 11:29 AM, Vladimir Ivanov wrote:
>> (BCCing hotspot-dev and moving the discussion to hotspot-compiler-dev.)
>>
>> Hi Cesar,
>>
>> Thanks for looking into enhancing EA.
>>
>> Overall, the proposal looks reasonable.
>>
>> I suggest to look more closely at split_unique_types().
>> It introduces a dedicated class of alias categories for fields of the 
>> allocation being eliminated and clones memory graph. I don't see why 
>> it shouldn't work for multiple allocations.
>>
>> Moreover, split_unique_types() will break if you start optimizing 
>> multiple allocations at once. The notion of unique alias should be 
>> adjusted and cover the union of unique aliases for all interacting 
>> allocations.
>>
>> Seems like you need to enhance SR to work on non-intersecting clusters 
>> of allocations.
>>
>> One thing to take care of: scalar replacement relies on 
>> TypeOopPtr::instance_id().
>>
>>    // If not InstanceTop or InstanceBot, indicates that this is
>>    // a particular instance of this type which is distinct.
>>    // This is the node index of the allocation node creating this 
>> instance.
>>    int           _instance_id;
>>
>> It'll break when multiple allocations are in play.
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> On 09.02.2022 04:45, Cesar Soares Lucas wrote:
>>> Hi there again!
>>>
>>> Can you please give me feedback on the following approach to at least 
>>> partially
>>> address [1], the scalar replacement allocation merge issue?
>>>
>>> The problem that I am trying to solve arises when allocations are 
>>> merged after a
>>> control flow split. The code below shows _one example_ of such a 
>>> situation.
>>>
>>> public int ex1(boolean cond, int x, int y) {
>>>      Point p = new Point(x, y);
>>>      if (cond)
>>>          p = new Point(y, x);
>>>      // Allocations for p are merged here.
>>>      return p.calc();
>>> }
>>>
>>> Assuming the method calls on "p" are inlined then the allocations 
>>> will not
>>> escape the method. The C2 IR for this method will look like this:
>>>
>>> public int ex1(boolean cond, int first, int second) {
>>>      p0 = Allocate(...);
>>>      ...
>>>      p0.x = first;
>>>      p0.y = second;
>>>
>>>      if (cond) {
>>>          p1 = Allocate(...);
>>>          ...
>>>          p1.x = second;
>>>          p1.y = first;
>>>      }
>>>
>>>      p = phi(p0, p1)
>>>
>>>      return p.x - p.y;
>>> }
>>>
>>> However, one of the constraints implemented here [2], specifically 
>>> the third
>>> one, will prevent the objects from being scalar replaced.
>>>
>>> The approach that I'm considering for solving the problem is to 
>>> replace the Phi
>>> node `p = phi(p0, p1)` with new Phi nodes for each of the fields of 
>>> the objects
>>> in the original Phi. The IR for `ex1` would look something like this 
>>> after the
>>> transformation:
>>>
>>> public int ex1(boolean cond, int first, int second) {
>>>      p0 = Allocate(...);
>>>      ...
>>>      p0.x = first;
>>>      p0.y = second;
>>>
>>>      if (cond) {
>>>          p1 = Allocate(...);
>>>          ...
>>>          p1.x = second;
>>>          p1.y = first;
>>>      }
>>>
>>>      pX = phi(first, second)
>>>      pY = phi(second, first)
>>>
>>>      return pX - pY;
>>> }
>>>
>>> I understand that this transformation might not be applicable for all 
>>> cases and
>>> that it's not as simple as illustrated above. Also, it seems to me 
>>> that much of
>>> what I'd have to implement is already implemented in other steps of 
>>> the Scalar
>>> Replacement pipeline (which is a good thing). To work around these
>>> implementation details I plan to use as much of the existing code as 
>>> possible.
>>> The algorithm for the transformation would be like this:
>>>
>>> split_phis(phi)
>>>      # If output of phi escapes, or something uses its identity, etc
>>>      # then we can't remove it. The conditions here might possible be 
>>> the
>>>      # same as the ones implemented in 
>>> `PhaseMacroExpand::can_eliminate_allocation`
>>>      if cant_remove_phi_output(phi)
>>>          return ;
>>>
>>>      # Collect a set of tuples(F,U) containing nodes U that uses field F
>>>      # member of the object resulting from `phi`.
>>>      fields_used = collect_fields_used_after_phi(phi)
>>>
>>>      foreach field in fields_used
>>>          producers = {}
>>>
>>>          # Create a list with the last Store for each field "field" 
>>> on the
>>>          # scope of each of the Phi input objects.
>>>          foreach o in phi.inputs
>>>              # The function called below might re-use a lot of the 
>>> code/logic in `PhaseMacroExpand::scalar_replacement`
>>>              producers += last_store_to_o_field(0, field)
>>>          # Create a new phi node whose inputs are the Store's to 'field'
>>>          field_phi = create_new_phi(producers)
>>>
>>>          update_consumers(field, field_phi)
>>>
>>> The implementation that I envisioned would be as a "pre-process" [3] 
>>> step just
>>> after EA but before the constraint checks in 
>>> `adjust_scalar_replaceable_state`
>>> [2]. If we agree that the overall Scalar Replacement implementation 
>>> goes through
>>> the following major phases:
>>>
>>>      1. Identify the Escape Status of objects.
>>>      2. Adjust object Escape and/or Scalar Replacement status based 
>>> on a set of constraints.
>>>      3. Make call to Split_unique_types [4].
>>>      4 Iterate over object and array allocations.
>>>          4.1 Check if allocation can be eliminated.
>>>          4.2 Perform scalar replacement. Replace uses of object in 
>>> Safepoints.
>>>          4.3 Process users of CheckCastPP other than Safepoint: AddP, 
>>> ArrayCopy and CastP2X.
>>>
>>> The transformation that I am proposing would change the overall flow 
>>> to look
>>> like this:
>>>
>>>      1. Identify the Escape Status of objects.
>>>      2. ----> New: "Split phi functions" <----
>>>      2. Adjust object Escape and/or Scalar Replacement status based 
>>> on a set of constraints.
>>>      3. Make call to Split_unique_types [14].
>>>      4 Iterate over object and array allocations.
>>>          4.1 ----> Moved to split_phi: "Check if allocation can be 
>>> eliminated" <----
>>>          4.2 Perform scalar replacement. Replace uses of object in 
>>> Safepoints.
>>>          4.3 Process users of CheckCastPP other than Safepoint: AddP, 
>>> ArrayCopy and CastP2X.
>>>
>>> Please let me know what you think and thank you for taking the time 
>>> to review
>>> this!
>>>
>>>
>>> Regards,
>>> Cesar
>>>
>>> Notes:
>>>
>>>      [1] I am not sure yet how this approach will play with the case 
>>> of a merge
>>>          with NULL.
>>>      [2] 
>>> https://github.com/openjdk/jdk/blob/2f71a6b39ed6bb869b4eb3e81bc1d87f4b3328ff/src/hotspot/share/opto/escape.cpp#L1809 
>>>
>>>
>>>      [3] Another option would be to "patch" the current 
>>> implementation to be able
>>>          to handle the merges. I am not certain that the "patch" 
>>> approach would be
>>>          better, however, the "pre-process" approach is certainly 
>>> much easier to test
>>>          and more readable.
>>>
>>>      [4] I cannot say I understand 100% the effects of executing
>>>          split_unique_types(). Would the transformation that I am 
>>> proposing need to
>>>          be after the call to split_unique_types?