RFR: JDK-8308994: C2: Re-implement experimental post loop vectorization

Tue Jun 27 17:52:15 UTC 2023

On Wed, 21 Jun 2023 08:24:19 GMT, Pengfei Li <pli at openjdk.org> wrote:

> ## TL;DR
> 
> This patch completely re-implements C2's experimental post loop vectorization for better stability, maintainability and performance. Compared with the original implementation, this new implementation adds a standalone loop phase in C2's ideal loop phases and can vectorize more post loops. The original implementation and all code related to multi-versioned post loops are deleted in this patch. More details about this patch can be found in the document replied in this pull request.

General question: Do you have any tests with varying loop limit, and check that you stop exactly at the right iteration? Would be even more interesting with mixed type examples. Just to see that you do not over/under duplicate the vectors.

src/hotspot/share/opto/vmaskloop.cpp line 595:

> 593:   uint tree_depth = exact_log2(large) - exact_log2(small) + 1;
> 594:   // All vector masks construct a perfect binary tree of "2 ^ depth - 1" nodes
> 595:   // We create a list of "2 ^ depth" nodes for easier computation.

Assume we have a small and a large type (byte and long). Size 1 and 8. `tree_depth = log2(8) - log2(1) + 1 = 3 - 0 + 1 = 4`. Then you generate a tree with `2^4-1 = 15` nodes. Did I calculate this right? That seems a bit excessive. Would be interesting to see benchmarks for mixed type cases.

src/hotspot/share/opto/vmaskloop.cpp line 735:

> 733:           vnode = new StoreVectorMaskedNode(ctrl, mem, addr, val, at, mask);
> 734:         }
> 735:       } else if (VectorNode::is_convert_opcode(opc)) {

Ok, this does work for same size conversions:
`./java -Xcomp -XX:-TieredCompilation -XX:+TraceNewVectors -XX:+TraceLoopOpts -XX:+UnlockExperimentalVMOptions -XX:+UseMaskedLoop -XX:+TraceMaskedLoop -XX:CompileCommand=compileonly,Test::test0 -XX:+TraceSuperWord Test.java`

public class Test {
    static int RANGE = 1024;

    public static void main(String[] strArr) {
        double a[] = new double[RANGE];
        long b[] = new long[RANGE];
        test0(a, b);
    }

    static void test0(double[] a, long[] b) {
        for (int i = 0; i < RANGE; i++) {
            b[i] = (long)a[i];
        }
    }
}

Good to see some conversion is possible. But if I replace double with float, I get `Vector element size does not match`. Can that limitation be lifted?

src/hotspot/share/opto/vmaskloop.cpp line 785:

> 783: }
> 784: 
> 785: // Duplicate vectorized operations with given vector element size

Got to here today. There should probably be some comment higher up that you first replace scalars with one vector each, and then duplicate them for the larger types that need multiple vectors.

I'm also concerned that there may be some platforms where the max vector width in bytes is not the same for all types. But maybe all platforms that support masked register ops also all have the same vector width in bytes for all types?

-------------

PR Review: https://git.openjdk.org/jdk/pull/14581#pullrequestreview-1501451796
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1244088279
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1244114831
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1244126073