RFR: JDK-8308994: C2: Re-implement experimental post loop vectorization
Emanuel Peter
epeter at openjdk.org
Tue Jun 27 17:52:20 UTC 2023
On Tue, 27 Jun 2023 17:16:11 GMT, Emanuel Peter <epeter at openjdk.org> wrote:
>> src/hotspot/share/opto/vmaskloop.cpp line 595:
>>
>>> 593: uint tree_depth = exact_log2(large) - exact_log2(small) + 1;
>>> 594: // All vector masks construct a perfect binary tree of "2 ^ depth - 1" nodes
>>> 595: // We create a list of "2 ^ depth" nodes for easier computation.
>>
>> Assume we have a small and a large type (byte and long). Size 1 and 8. `tree_depth = log2(8) - log2(1) + 1 = 3 - 0 + 1 = 4`. Then you generate a tree with `2^4-1 = 15` nodes. Did I calculate this right? That seems a bit excessive. Would be interesting to see benchmarks for mixed type cases.
>
> Can there be cases where creating the masks makes vectorization unprofitable?
I have an example here:
public class Test {
static int RANGE = 1024;
public static void main(String[] strArr) {
byte a[] = new byte[RANGE];
long b[] = new long[RANGE];
test0(a, b);
}
static void test0(byte[] a, long[] b) {
for (int i = 0; i < RANGE; i++) {
a[i]++;
b[i]++;
}
}
}
`./java -Xcomp -XX:-TieredCompilation -XX:+TraceNewVectors -XX:+TraceLoopOpts -XX:+UnlockExperimentalVMOptions -XX:+UseMaskedLoop -XX:+TraceMaskedLoop -XX:CompileCommand=compileonly,Test::test0 Test.java`
This are the masks:
Generated vector masks in vmask tree
Lane_size = 1
3710 LoopVectorMask === _ 367 26 [[ 3711 3712 ]] #vectormask[64]:{byte}
Lane_size = 2
3711 ExtractLowMask === _ 3710 [[ 3713 3714 ]] #vectormask[32]:{short}
3712 ExtractHighMask === _ 3710 [[ 3715 3716 ]] #vectormask[32]:{short}
Lane_size = 4
3713 ExtractLowMask === _ 3711 [[ 3717 3718 ]] #vectormask[16]:{int}
3714 ExtractHighMask === _ 3711 [[ 3719 3720 ]] #vectormask[16]:{int}
3715 ExtractLowMask === _ 3712 [[ 3721 3722 ]] #vectormask[16]:{int}
3716 ExtractHighMask === _ 3712 [[ 3723 3724 ]] #vectormask[16]:{int}
Lane_size = 8
3717 ExtractLowMask === _ 3713 [[ ]] #vectormask[8]:{long}
3718 ExtractHighMask === _ 3713 [[ ]] #vectormask[8]:{long}
3719 ExtractLowMask === _ 3714 [[ ]] #vectormask[8]:{long}
3720 ExtractHighMask === _ 3714 [[ ]] #vectormask[8]:{long}
3721 ExtractLowMask === _ 3715 [[ ]] #vectormask[8]:{long}
3722 ExtractHighMask === _ 3715 [[ ]] #vectormask[8]:{long}
3723 ExtractLowMask === _ 3716 [[ ]] #vectormask[8]:{long}
3724 ExtractHighMask === _ 3716 [[ ]] #vectormask[8]:{long}
That is indeed `15` masks. Hmm. Maybe that is the best one can do. And maybe it is not all that bad. But again, would be interesting to see the benchmarks for that case.
-------------
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1244104209
More information about the hotspot-compiler-dev
mailing list