RFR: 8304720: SuperWord::schedule should rebuild C2-graph from SuperWord dependency-graph

Thu May 4 10:25:19 UTC 2023

On Wed, 5 Apr 2023 14:55:55 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

> `SuperWord:schedule`, and specifically `SuperWord::co_locate_pack` is broken.
> The problem is with the basic approach of it, as far as I know.
> Hence, I had to completely re-design the `schedule` algorithm, based on the `PacksetGraph` ([JDK-8304042](https://bugs.openjdk.org/browse/JDK-8304042), https://git.openjdk.org/jdk/pull/13078).
> 
> **The current approach**
> 
> The idea is to leave the non-vectorized memory ops in their place, and find the right place for the vectorized memops to be "sandwiched" into. The logic is very complex and has already had a few bugs fixed.
> 
> **Why this does not work**
> 
> However, in some rare cases, we have to reorder non-vectorized operations. See this example that I added as a regression test:
> 
> https://github.com/openjdk/jdk/blob/a771a61005aea272cc51fa3f3e1637c217582fce/test/hotspot/jtreg/compiler/loopopts/superword/TestScheduleReordersScalarMemops.java#L82-L109
> 
> I found this issue during work on https://github.com/openjdk/jdk/pull/13078, where I had to restrict/disable some tests that are now passing.
> 
> **Solution**
> 
> Abandon the idea of "sandwiching" memops. Rewrite `SuperWord:schedule`:
> 
> https://github.com/openjdk/jdk/blob/6bb2da3da988618803823e905f23cb106cd9d6b2/src/hotspot/share/opto/superword.cpp#L2567-L2576
> 
> We first schedule all memops into a linear order.
> We do this scheduling based on the `PacksetGraph`, which gives us a `DAG` based on the `packset` and the dependency-graph (which in turn respects the data use-defs, as well as the memory dependencies, unless we can prove that they do not reference the same memory). 
> In other words: we have a linearization that respects all dependencies that must be respected.
> Further, we make sure that ops from the same pack are scheduled as a block (all adjacent to each other), and in order that the packset has internally.
> 
> https://github.com/openjdk/jdk/blob/6bb2da3da988618803823e905f23cb106cd9d6b2/src/hotspot/share/opto/superword.cpp#L2489-L2493
> 
> Now that we have this order (and we have not aborted because we found a cycle in the `PacksetGraph`), we must apply this schedule to each memory slice, and reorder the memops in the slices accordingly.
> 
> https://github.com/openjdk/jdk/blob/6bb2da3da988618803823e905f23cb106cd9d6b2/src/hotspot/share/opto/superword.cpp#L2617-L2619
> 
> This scheduling has the nice side-effect of simplifying `SuperWord::output` a little.
> We know now that the first element in a pack is also first in the slice order, and the last element in the pack is last in the slice (because we schedule the packs as a block, i.e. in the pack order).
> 
> **Discussion**
> 
> This seems to me to be a much more straight forward approach, and it uses the code I recently added for verification of cyclic dependencies in the packset ([JDK-8304042](https://bugs.openjdk.org/browse/JDK-8304042), https://git.openjdk.org/jdk/pull/13078).
> 
> One potential improvement to my fix:
> We now sometimes re-order the non-vectorized memory slices, even though it may not be necessary.
> This is not wrong, but it makes updates to the graph that may be confusing when debugging.
> Further, the re-ordering may have performance impacts.
> I could use a priority-queue (min-heap, would have to implement it since it does not yet exist), and schedule the `PacksetGraph` whenever possible with the lower `bb_idx` first. This would make the new linear order the same/closer to the old one. However, I am not sure if this is worth the effort and overhead of a priority-queue.
> 
> **Testing**
> Github-actions pass. tier1-6 + stress testing passes.
> Performance testing showed no significant performance change.

Great work!

src/hotspot/share/opto/superword.cpp line 2766:

> 2764: 
> 2765:   for (int i = 0; i < _block.length(); i++) {
> 2766:     Node* n = _block.at(i); // last in pack

Nit: how about moving this comment line to the upward side of the if clause on L2768? A little bit confusing now.

src/hotspot/share/opto/superword.cpp line 2768:

> 2766:     Node* n = _block.at(i); // last in pack
> 2767:     Node_List* p = my_pack(n);
> 2768:     if (p != nullptr && n == p->at(p->size()-1)) {

Sorry, I don't quite understand why the mem ops in the pack are internally in order. Maybe I missed somewhere you reordered these ops in the same pack using linearized memops_schedule. Could you please point it out for me? Thanks.

test/hotspot/jtreg/compiler/loopopts/superword/TestScheduleReordersScalarMemops.java line 32:

> 30:  *          be reordered during SuperWord::schedule.
> 31:  * @requires vm.compiler2.enabled
> 32:  * @requires vm.cpu.features ~= ".*avx2.*" | vm.cpu.features ~= ".*asimd.*"

Can we drop this line just like you did in the files above? Seems we have cpu feature check separately for each test.

-------------

PR Review: https://git.openjdk.org/jdk/pull/13354#pullrequestreview-1412534697
PR Review Comment: https://git.openjdk.org/jdk/pull/13354#discussion_r1184824754
PR Review Comment: https://git.openjdk.org/jdk/pull/13354#discussion_r1184819886
PR Review Comment: https://git.openjdk.org/jdk/pull/13354#discussion_r1184680686