RFR: 8333393: PhaseCFG::insert_anti_dependences can fail to raise LCAs and to add necessary anti-dependence edges [v10]

Fri Jan 24 18:21:50 UTC 2025

On Fri, 24 Jan 2025 18:15:34 GMT, Daniel Lundén <dlunden at openjdk.org> wrote:

>> When searching for load anti dependences in GCM, it is not always sufficient to just search starting at the direct initial memory input to the load. Specifically, there are cases when we must also search for anti dependences starting at relevant Phi memory nodes in between the load's early block and the initial memory input's block. Here, "in between" refers to blocks in the dominator tree in between the early and initial memory blocks.
>> 
>> #### Example 1
>> 
>> Consider the ideal graph below. The initial memory for 183 loadI is 107 Phi and there is an important anti dependency for node 64 membar_release. To discover this anti dependency, we must rather search from 119 Phi which contains overlapping memory slices with 107 Phi. Looking at the ideal graph block view, we see that both 107 Phi and 119 Phi are in the initial memory block (B7) and thus dominate the early block (B20). If we only search from 107 Phi, we fail to add the anti dependency to 64 membar_release and do not force the load to schedule before 64 membar_release as we should. In the block view, we see that the load is actually scheduled in B24 _after_ a number of anti-dependent stores, the first of which is in block B20 (corresponding to the anti dependency on 64 membar_release). The result is the failure we see in this issue (we load the wrong value).
>> 
>> ![failure-graph-1](https://github.com/user-attachments/assets/e5458646-7a5c-40e1-b1d8-e3f101e29b73)
>> ![failure-blocks-1](https://github.com/user-attachments/assets/a0b1f724-0809-4b2f-9feb-93e9c59a5d6a)
>> 
>> #### Example 2
>> 
>> There are also situations when we need to start searching from Phis that are strictly in between the initial memory block and early block. Consider the ideal graph below. The initial memory for 100 loadI is 18 MachProj, but we also need to search from 76 Phi to find that we must raise the LCA to the last block on the path between 76 Phi and 75 Phi: B9 (= the load's early block). If we do not search from 76 Phi, the load is again likely scheduled too late (in B11 in the example) after anti-dependent stores (the first of which corresponds to 58 membar_release in B10). Note that the block B6 for 76 Phi is strictly dominated by the initial memory block B2 and also strictly dominates the early block B9.
>> 
>> ![failure-graph-2](https://github.com/user-attachments/assets/ede0c299-6251-4ff8-8b84-af40a1ee9e8c)
>> ![failure-blocks-2](https://github.com/user-attachments/assets/e5a87e43-b6fe-4fa3-8961-54752f63633e)
>> 
>> ### Cha...
>
> Daniel Lundén has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Rewording of semantics

I expect the explanation to start with the relation between the graph topology and the constraints it imposes on scheduling, then why the implementation successfully implies that relation. Based on my understanding of the issue, which I'm not sure if it is correct, I propose the following explanation:

The relation between the memory graph topology and the constraints it imposes on scheduling:

* At a given point in the control flow, a memory node `n` can have 3 states: undefined, live, killed. The state for each location satisfies that:
  - In a basic block `b`:
    + If `n` is undefined at `i`, then `n` is undefined at `i + 1` unless the node at `i` is `n`, in which case `n` is live at `i + 1`.
    + If `n` is live at `i`, then `n` is live at `i + 1` unless the node at `i` is a memory kill that can kill `n`, in which case `n` is killed at `i + 1`.
    + If `n` is killed at `i`, then `n` is killed at `i + 1`.
  - At the beginning of a basic block `b`:
    + If `n` is undefined at the end of any of the predecessors of `b`, then `n` is undefined at the beginning of `b`.
    + Otherwise, if `n` is killed at the end of any of the predecessors of `b`, then `n` is killed at the beginning of `b`.
    + Otherwise, `n` is live at the beginning of `b`.
* A memory kill `k` that can kill a memory node `n`
  - There are 2 kinds of kill, 1-kill and 2-kill (can think of as whole and part).
  - If memory kill `k` 1-kills a memory node `n`, then `k` 2-kills `n`.
  - If a memory node `n` is the memory input of a memory kill `k`, then `k` 1-kills `n`.
  - If a memory kill `k` 1-kills a memory merge (`MergeMem`, `Phi`), then `k` 1-kills all of the merge inputs.
  - If a memory kill `k` 2-kills an input of a memory merge, then `k` 2-kills the merge.
  - All memory nodes that are not `MergeMem` or `Phi` are memory kills.
* A load that has a memory input `n` can only be scheduled at the locations at which `n` is live.

As a result, to search for all kill `k` that can kill a memory node `n`, we need to go up the memory graph through all merges as well as go down the memory graph through all merges (note that we only go up in the up part and go down during the down part, which means we do not go horizontally, going from a `Phi` input to another `Phi` input). Then we collect all memory kills that have one of those nodes we visited as the memory input, discard all nodes such that `n` is undefined at its position (maybe using dominator information), the remaining nodes are the anti-dependence stores.

I am not sure how to go from this to our implementation. Consider all the cases:

- Going down through a `MergeMem`.
- Going down through a `Phi`: This `Phi` must be dominated by `n`, or else `n` is undefined and we can skip it. Since this `Phi` must be dominated by `n`, it can either:
  + Lies between the block of `n` or `early`, which we covers in our sweep
  + Lies after `early`, then we just schedule the load above the `Phi`, skipping any anti-dependence search since they will be satisfied by this operation
- Going up through a `Phi`: Do we cover this?
- Going up through a `MergeMem`: I assume there should be no kill of the inputs of a `MergeMem` so we do not need to care about this?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22852#issuecomment-2613130457