RFR: 8308994: C2: Re-implement experimental post loop vectorization [v2]
Pengfei Li
pli at openjdk.org
Tue Jul 4 02:17:22 UTC 2023
On Wed, 28 Jun 2023 10:24:58 GMT, Emanuel Peter <epeter at openjdk.org> wrote:
>> I have an example here:
>>
>> public class Test {
>> static int RANGE = 1024;
>>
>> public static void main(String[] strArr) {
>> byte a[] = new byte[RANGE];
>> long b[] = new long[RANGE];
>> test0(a, b);
>> }
>>
>> static void test0(byte[] a, long[] b) {
>> for (int i = 0; i < RANGE; i++) {
>> a[i]++;
>> b[i]++;
>> }
>> }
>> }
>>
>> `./java -Xcomp -XX:-TieredCompilation -XX:+TraceNewVectors -XX:+TraceLoopOpts -XX:+UnlockExperimentalVMOptions -XX:+UseMaskedLoop -XX:+TraceMaskedLoop -XX:CompileCommand=compileonly,Test::test0 Test.java`
>> This are the masks:
>>
>> Generated vector masks in vmask tree
>> Lane_size = 1
>> 3710 LoopVectorMask === _ 367 26 [[ 3711 3712 ]] #vectormask[64]:{byte}
>> Lane_size = 2
>> 3711 ExtractLowMask === _ 3710 [[ 3713 3714 ]] #vectormask[32]:{short}
>> 3712 ExtractHighMask === _ 3710 [[ 3715 3716 ]] #vectormask[32]:{short}
>> Lane_size = 4
>> 3713 ExtractLowMask === _ 3711 [[ 3717 3718 ]] #vectormask[16]:{int}
>> 3714 ExtractHighMask === _ 3711 [[ 3719 3720 ]] #vectormask[16]:{int}
>> 3715 ExtractLowMask === _ 3712 [[ 3721 3722 ]] #vectormask[16]:{int}
>> 3716 ExtractHighMask === _ 3712 [[ 3723 3724 ]] #vectormask[16]:{int}
>> Lane_size = 8
>> 3717 ExtractLowMask === _ 3713 [[ ]] #vectormask[8]:{long}
>> 3718 ExtractHighMask === _ 3713 [[ ]] #vectormask[8]:{long}
>> 3719 ExtractLowMask === _ 3714 [[ ]] #vectormask[8]:{long}
>> 3720 ExtractHighMask === _ 3714 [[ ]] #vectormask[8]:{long}
>> 3721 ExtractLowMask === _ 3715 [[ ]] #vectormask[8]:{long}
>> 3722 ExtractHighMask === _ 3715 [[ ]] #vectormask[8]:{long}
>> 3723 ExtractLowMask === _ 3716 [[ ]] #vectormask[8]:{long}
>> 3724 ExtractHighMask === _ 3716 [[ ]] #vectormask[8]:{long}
>>
>> That is indeed `15` masks. Hmm. Maybe that is the best one can do. And maybe it is not all that bad. But again, would be interesting to see the benchmarks for that case.
>
> Aha, maybe here we could just get away with 1 vmask for `byte`, and then directly extract 8 vmasks for `long`, since we do not need the ones in the middle? You'd have to generalize your `Extract(High/Low)Mask`.
We just benchmarked this "byte + long" case and saw some performance regressions after vectorization. Yes, too many mask operations are expensive. GCC does this in a better way: For adjacent data sizes (larger = 2 * smaller), it extracts two halves of the vector mask, but for non-adjacent data sizes, it re-generates vector masks without extraction.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1251401597
More information about the hotspot-compiler-dev
mailing list