RFR: JDK-8308994: C2: Re-implement experimental post loop vectorization

Tue Jun 27 17:52:20 UTC 2023

On Tue, 27 Jun 2023 17:16:11 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> src/hotspot/share/opto/vmaskloop.cpp line 595:
>> 
>>> 593:   uint tree_depth = exact_log2(large) - exact_log2(small) + 1;
>>> 594:   // All vector masks construct a perfect binary tree of "2 ^ depth - 1" nodes
>>> 595:   // We create a list of "2 ^ depth" nodes for easier computation.
>> 
>> Assume we have a small and a large type (byte and long). Size 1 and 8. `tree_depth = log2(8) - log2(1) + 1 = 3 - 0 + 1 = 4`. Then you generate a tree with `2^4-1 = 15` nodes. Did I calculate this right? That seems a bit excessive. Would be interesting to see benchmarks for mixed type cases.
>
> Can there be cases where creating the masks makes vectorization unprofitable?

I have an example here:

public class Test {
    static int RANGE = 1024;

    public static void main(String[] strArr) {
        byte a[] = new byte[RANGE];
        long b[] = new long[RANGE];
        test0(a, b);
    }

    static void test0(byte[] a, long[] b) {
        for (int i = 0; i < RANGE; i++) {
            a[i]++;
            b[i]++;
        }
    }
}

`./java -Xcomp -XX:-TieredCompilation -XX:+TraceNewVectors -XX:+TraceLoopOpts -XX:+UnlockExperimentalVMOptions -XX:+UseMaskedLoop -XX:+TraceMaskedLoop -XX:CompileCommand=compileonly,Test::test0 Test.java`
This are the masks:

Generated vector masks in vmask tree
Lane_size = 1
 3710  LoopVectorMask  === _ 367 26  [[ 3711 3712 ]]  #vectormask[64]:{byte}
Lane_size = 2
 3711  ExtractLowMask  === _ 3710  [[ 3713 3714 ]]  #vectormask[32]:{short}
 3712  ExtractHighMask  === _ 3710  [[ 3715 3716 ]]  #vectormask[32]:{short}
Lane_size = 4
 3713  ExtractLowMask  === _ 3711  [[ 3717 3718 ]]  #vectormask[16]:{int}
 3714  ExtractHighMask  === _ 3711  [[ 3719 3720 ]]  #vectormask[16]:{int}
 3715  ExtractLowMask  === _ 3712  [[ 3721 3722 ]]  #vectormask[16]:{int}
 3716  ExtractHighMask  === _ 3712  [[ 3723 3724 ]]  #vectormask[16]:{int}
Lane_size = 8
 3717  ExtractLowMask  === _ 3713  [[ ]]  #vectormask[8]:{long}
 3718  ExtractHighMask  === _ 3713  [[ ]]  #vectormask[8]:{long}
 3719  ExtractLowMask  === _ 3714  [[ ]]  #vectormask[8]:{long}
 3720  ExtractHighMask  === _ 3714  [[ ]]  #vectormask[8]:{long}
 3721  ExtractLowMask  === _ 3715  [[ ]]  #vectormask[8]:{long}
 3722  ExtractHighMask  === _ 3715  [[ ]]  #vectormask[8]:{long}
 3723  ExtractLowMask  === _ 3716  [[ ]]  #vectormask[8]:{long}
 3724  ExtractHighMask  === _ 3716  [[ ]]  #vectormask[8]:{long}

That is indeed `15` masks. Hmm. Maybe that is the best one can do. And maybe it is not all that bad. But again, would be interesting to see the benchmarks for that case.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1244104209