RFR: 8308994: C2: Re-implement experimental post loop vectorization [v2]

Tue Jul 4 02:17:22 UTC 2023

On Wed, 28 Jun 2023 10:24:58 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> I have an example here:
>> 
>> public class Test {
>>     static int RANGE = 1024;
>> 
>>     public static void main(String[] strArr) {
>>         byte a[] = new byte[RANGE];
>>         long b[] = new long[RANGE];
>>         test0(a, b);
>>     }
>> 
>>     static void test0(byte[] a, long[] b) {
>>         for (int i = 0; i < RANGE; i++) {
>>             a[i]++;
>>             b[i]++;
>>         }
>>     }
>> }
>> 
>> `./java -Xcomp -XX:-TieredCompilation -XX:+TraceNewVectors -XX:+TraceLoopOpts -XX:+UnlockExperimentalVMOptions -XX:+UseMaskedLoop -XX:+TraceMaskedLoop -XX:CompileCommand=compileonly,Test::test0 Test.java`
>> This are the masks:
>> 
>> Generated vector masks in vmask tree
>> Lane_size = 1
>>  3710  LoopVectorMask  === _ 367 26  [[ 3711 3712 ]]  #vectormask[64]:{byte}
>> Lane_size = 2
>>  3711  ExtractLowMask  === _ 3710  [[ 3713 3714 ]]  #vectormask[32]:{short}
>>  3712  ExtractHighMask  === _ 3710  [[ 3715 3716 ]]  #vectormask[32]:{short}
>> Lane_size = 4
>>  3713  ExtractLowMask  === _ 3711  [[ 3717 3718 ]]  #vectormask[16]:{int}
>>  3714  ExtractHighMask  === _ 3711  [[ 3719 3720 ]]  #vectormask[16]:{int}
>>  3715  ExtractLowMask  === _ 3712  [[ 3721 3722 ]]  #vectormask[16]:{int}
>>  3716  ExtractHighMask  === _ 3712  [[ 3723 3724 ]]  #vectormask[16]:{int}
>> Lane_size = 8
>>  3717  ExtractLowMask  === _ 3713  [[ ]]  #vectormask[8]:{long}
>>  3718  ExtractHighMask  === _ 3713  [[ ]]  #vectormask[8]:{long}
>>  3719  ExtractLowMask  === _ 3714  [[ ]]  #vectormask[8]:{long}
>>  3720  ExtractHighMask  === _ 3714  [[ ]]  #vectormask[8]:{long}
>>  3721  ExtractLowMask  === _ 3715  [[ ]]  #vectormask[8]:{long}
>>  3722  ExtractHighMask  === _ 3715  [[ ]]  #vectormask[8]:{long}
>>  3723  ExtractLowMask  === _ 3716  [[ ]]  #vectormask[8]:{long}
>>  3724  ExtractHighMask  === _ 3716  [[ ]]  #vectormask[8]:{long}
>> 
>> That is indeed `15` masks. Hmm. Maybe that is the best one can do. And maybe it is not all that bad. But again, would be interesting to see the benchmarks for that case.
>
> Aha, maybe here we could just get away with 1 vmask for `byte`, and then directly extract 8 vmasks for `long`, since we do not need the ones in the middle? You'd have to generalize your `Extract(High/Low)Mask`.

We just benchmarked this "byte + long" case and saw some performance regressions after vectorization. Yes, too many mask operations are expensive. GCC does this in a better way: For adjacent data sizes (larger = 2 * smaller), it extracts two halves of the vector mask, but for non-adjacent data sizes, it re-generates vector masks without extraction.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1251401597