RFR: JDK-8308994: C2: Re-implement experimental post loop vectorization

Emanuel Peter epeter at openjdk.org
Wed Jun 28 11:09:23 UTC 2023


On Tue, 27 Jun 2023 17:25:11 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> Can there be cases where creating the masks makes vectorization unprofitable?
>
> I have an example here:
> 
> public class Test {
>     static int RANGE = 1024;
> 
>     public static void main(String[] strArr) {
>         byte a[] = new byte[RANGE];
>         long b[] = new long[RANGE];
>         test0(a, b);
>     }
> 
>     static void test0(byte[] a, long[] b) {
>         for (int i = 0; i < RANGE; i++) {
>             a[i]++;
>             b[i]++;
>         }
>     }
> }
> 
> `./java -Xcomp -XX:-TieredCompilation -XX:+TraceNewVectors -XX:+TraceLoopOpts -XX:+UnlockExperimentalVMOptions -XX:+UseMaskedLoop -XX:+TraceMaskedLoop -XX:CompileCommand=compileonly,Test::test0 Test.java`
> This are the masks:
> 
> Generated vector masks in vmask tree
> Lane_size = 1
>  3710  LoopVectorMask  === _ 367 26  [[ 3711 3712 ]]  #vectormask[64]:{byte}
> Lane_size = 2
>  3711  ExtractLowMask  === _ 3710  [[ 3713 3714 ]]  #vectormask[32]:{short}
>  3712  ExtractHighMask  === _ 3710  [[ 3715 3716 ]]  #vectormask[32]:{short}
> Lane_size = 4
>  3713  ExtractLowMask  === _ 3711  [[ 3717 3718 ]]  #vectormask[16]:{int}
>  3714  ExtractHighMask  === _ 3711  [[ 3719 3720 ]]  #vectormask[16]:{int}
>  3715  ExtractLowMask  === _ 3712  [[ 3721 3722 ]]  #vectormask[16]:{int}
>  3716  ExtractHighMask  === _ 3712  [[ 3723 3724 ]]  #vectormask[16]:{int}
> Lane_size = 8
>  3717  ExtractLowMask  === _ 3713  [[ ]]  #vectormask[8]:{long}
>  3718  ExtractHighMask  === _ 3713  [[ ]]  #vectormask[8]:{long}
>  3719  ExtractLowMask  === _ 3714  [[ ]]  #vectormask[8]:{long}
>  3720  ExtractHighMask  === _ 3714  [[ ]]  #vectormask[8]:{long}
>  3721  ExtractLowMask  === _ 3715  [[ ]]  #vectormask[8]:{long}
>  3722  ExtractHighMask  === _ 3715  [[ ]]  #vectormask[8]:{long}
>  3723  ExtractLowMask  === _ 3716  [[ ]]  #vectormask[8]:{long}
>  3724  ExtractHighMask  === _ 3716  [[ ]]  #vectormask[8]:{long}
> 
> That is indeed `15` masks. Hmm. Maybe that is the best one can do. And maybe it is not all that bad. But again, would be interesting to see the benchmarks for that case.

Aha, maybe here we could just get away with 1 vmask for `byte`, and then directly extract 8 vmasks for `long`, since we do not need the ones in the middle? You'd have to generalize your `Extract(High/Low)Mask`.

>> ![image](https://github.com/openjdk/jdk/assets/32593061/a00e4973-2faf-428e-9794-48abb945e815)
>> 
>> That indeed looks like a mixup in the int/float memory slices. Not sure if there are any bad consequences, but that should be fixed.
>
> I just added some shorts, so that the int and float would be duplicated ;)

Suggested solution: track the last memory state per slice, just like I recently did in `SuperWord::schedule_reorder_memops` with `current_state_in_slice`.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1245012558
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1245004049


More information about the hotspot-dev mailing list