Re:[Vector]fromArray/allTrue performance

王卓(卓仁) zhuoren.wz at alibaba-inc.com
Wed Nov 21 09:17:18 UTC 2018


Hello Ivanov,
After further investigation, I found a potential bug, which might cause crash in using the APIs. I would like to introduce the crash first and then my workaround for Long128Vector fromArray. I will inline all patches and java files.
The crash
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f2ae4cfa5bf, pid=120899, tid=120938
#
# JRE version: OpenJDK Runtime Environment (12.0) (build 12-internal+0-adhoc.admin.dev)
# Java VM: OpenJDK 64-Bit Server VM (12-internal+0-adhoc.admin.dev, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0x6e65bf]  PhiNode::Value(PhaseGVN*) const+0xef
#
# Core dump will be written. Default location: /home/admin/zhuoren/dev/core.120899
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
Basically, t is caused by type of a PhiNode is used but not set. 
To reproduce the crash, we must modify JDK code. The change is pure java but can cause crash in C2 compiler thread. Sorry that I cannot reproduce this crash without changing JDK, because getBits is not opposed.
Steps to reproduce the crash:
Step 1. Apply below patch to JDK. The patch itself is also a workaround for Long128Vector fromArray using masks. 

diff -r a701d05cc2eb src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Long128Vector.java
--- a/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Long128Vector.java    Thu Nov 15 17:18:03 2018 -0800
+++ b/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Long128Vector.java    Wed Nov 21 16:54:56 2018 +0800
@@ -1323,7 +1323,22 @@
         @Override
         @ForceInline
         public Long128Vector fromArray(long[] a, int ax, Mask<Long, Shapes.S128Bit> m) {
-            return zero().blend(fromArray(a, ax), m);
+            boolean[] bits = ((Long128Mask)m).getBits();
+            if (bits[0] == false) {
+                if (bits[1] == false) {
+                    return zero();
+                } else {
+                    Long128Vector v = SPECIES.fromArray(a, ax);
+                    return v.with(0, 0);
+                }
+            } else {
+                Long128Vector v = SPECIES.fromArray(a, ax);
+                if (bits[1] == false) {
+                    return v.with(1, 0);
+                } else {
+                    return v;
+                }
+            }
         }

         @Override

Step 2. build new JDK
Step 3. Run this demo: VectorMatrixUpdateMaskLong.java 

import jdk.incubator.vector.*;
import java.util.Arrays;
import java.util.Random;
import java.io.File;
import java.lang.reflect.Field;
import java.io.IOException;
import sun.misc.Unsafe;
import jdk.incubator.vector.Vector.Mask;
public class VectorMatrixUpdateMaskLong
{

    public static int size = 1024 * 16;
    static Random random = new Random();
    static final LongVector.LongSpecies<Shapes.S128Bit> species128 = LongVector.species(Shapes.S_128_BIT);
    static boolean[] AisNull0 = new boolean[size];
    static long[] result0 = new long[size];
    static long[] input0 = new long[size];
    public static void main(String[] args) throws  NoSuchFieldException, SecurityException, IllegalArgumentException, IllegalAccessException, InstantiationException {
        System.out.println("doing warmup");
        long start1 = System.currentTimeMillis();
        long normalTime = 0;
        long vecTime = 0;
        int i = 0;
        for (i = 0; i < size; i++) {
            result0[i] = 0;
            input0[i] = random.nextLong();
            if (random.nextInt(10) > 6) {
                AisNull0[i] = true;
            } else {
                AisNull0[i] = false;
            }
        }
        for (i = 0; i < 20000; i++) {
            vecTest(species128);
        }
        System.out.println("begin test");
        start1 = System.currentTimeMillis();
        for (i = 0; i < 10000; i++) {
            vecTest(species128);
        }
        vecTime = System.currentTimeMillis() - start1;
        System.out.println("vector time used:" + vecTime);
    }
    static <S extends Vector.Shape> void vecTest(LongVector.LongSpecies<S> longSpecies ) {
        LongVector<S> v;
        int i = 0;
        Mask mask0;
        for (i = 0; i + (longSpecies.length()) <= size; i += longSpecies.length()) {
            mask0 = longSpecies.maskFromArray(AisNull0, i);
            v = longSpecies.fromArray(input0, i, mask0);
            v.intoArray(result0, i);
        }
        return;
    }
}

And then you can see the crash. 
I wonder if this crash is a known issue. If it is not a known issue and none is fixing, I am very glad to find the root cause and fix this issue.


Another workaround for for Long128Vector fromArray using masks
This patch can fix the performance issue. With this patch, VectorMatrixUpdateMaskLong.java is 10 - 15 times faster. Of cause  mask0 = longSpecies.maskFromArray(AisNull0, i);  must be moved out of main loop to get rid of maskFromArray performance issue.
Please give advice on this workaround.

diff -r a701d05cc2eb src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Long128Vector.java
--- a/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Long128Vector.java    Thu Nov 15 17:18:03 2018 -0800
+++ b/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Long128Vector.java    Tue Nov 20 16:45:18 2018 +0800
@@ -42,6 +42,10 @@

     static final Long128Vector ZERO = new Long128Vector();

+    static final Long128Vector ZEROONE = allocZeroOne();
+
+    static final Long128Vector ONEZERO = allocOneZero();
+
     static final int LENGTH = SPECIES.length();

     private final long[] vec; // Don't access directly, use getElements() instead.
@@ -58,6 +62,20 @@
         vec = v;
     }

+    static private Long128Vector allocZeroOne() {
+        long[] zeroOneArray = new long[2];
+        zeroOneArray[0] = 0xffffffffffffffffl;
+        zeroOneArray[1] = 0;
+        return new Long128Vector(zeroOneArray);
+    }
+
+    static private Long128Vector allocOneZero() {
+        long[] oneZeroArray = new long[2];
+        oneZeroArray[0] = 0;
+        oneZeroArray[1] = 0xffffffffffffffffl;
+        return new Long128Vector(oneZeroArray);
+    }
+
     @Override
     public int length() { return LENGTH; }

@@ -1323,7 +1341,20 @@
         @Override
         @ForceInline
         public Long128Vector fromArray(long[] a, int ax, Mask<Long, Shapes.S128Bit> m) {
-            return zero().blend(fromArray(a, ax), m);
+            boolean[] bits = ((Long128Mask)m).getBits();
+            if (bits[0] == false) {
+                if (bits[1] == false) { // mask is 0 0
+                    return SPECIES.zero();
+                } else {                // mask is 0 1
+                    return SPECIES.fromArray(a, ax).and(ONEZERO);
+                }
+            } else {
+                if (bits[1] == true) {  // mask is 1 1
+                    return SPECIES.fromArray(a, ax);
+                } else {                // mask is 0 1
+                    return SPECIES.fromArray(a, ax).and(ZEROONE);
+                }
+            }
         }

         @Override

Regards,
Zhuoren
------------------------------------------------------------------
发件人:Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
发送时间:2018年11月20日(星期二) 09:39
收件人:王卓(卓仁) <zhuoren.wz at alibaba-inc.com>; panama-dev <panama-dev at openjdk.java.net>
主 题:Re: [Vector]fromArray/allTrue performance

Zhuo, thanks for the feedback!

Unfortunately, it seems the attachment was stripped by mail server.
Do you mind resending it inline?

I'll let Intel folks comment on individual cases you refer to, but 
overall everything marked as "Implementation limitation" is a 
work-in-progress and will be addressed later at some point.

Best regards,
Vladimir Ivanov

On 19/11/2018 02:41, 王卓(卓仁) wrote:
> Hello,
> I am Zhuoren from Alibaba JVM team. Glad to take part in project panama.
> We have integrated Vector API to Alibaba JDK, and another Alibaba team is now using Vector API to optimize their applications.
> Here are some issues we found in our previous work and blocked future optimization, and I am searching for solutions.
> 
> 1. The performance of Long128Species::fromArray(long[] a, int ax, Mask<Long, Shapes.S128Bit> m) is bad. The attached java file is a test for this API.
> I checked the performance issue is due to intrinsic failure.
> x86.ad:
>          case Op_VectorLoadMask:
>            if (UseSSE <= 3) { ret_value = false; }
>            else if (vlen == 1 || vlen == 2) { ret_value = false; } // Implementation limitation
>            else if (size_in_bits >= 256 && UseAVX < 2) { ret_value = false; } // Implementation limitation
>            break;
> 
> I wonder if there will be a fix for this issue, because this API is very important to our optimizations. The fromArray.patch is the workaround we are using. How to improve this workaround is also welcome.
> 
> 2. anyTrue/allTrue on 512 bit
> This is another intrinsic failure:
>          case Op_VectorTest:
>            if (UseAVX <= 0) { ret_value = false; }
>            else if (size_in_bits != 128 && size_in_bits != 256) { ret_value = false; } // Implementation limitation
>            break;
> This also blocked some of our 512 bit optimizations, but I have no workaround for this now.
> 
> Please share with me your advice, thanks!
> 
> Regards,
> Zhuo
> 


More information about the panama-dev mailing list