回复：Re：[Vector]fromArray/allTrue performance

Thu Nov 22 09:04:02 UTC 2018

Ivanov,

> Does the following patch fix the crash for you?
This patch does fix the crash. Please go ahead to push it.

> Proper way to fix the issue is to introduce missing support on C2 side.
> Feel free to take care of it. (If anybody is already on it, please, 
speak up!)
I will work on the missing support on C2 side to fix perf issue of Long128Vector.fromArray(), if nobody is on it.

Regards,
Zhuoren


------------------------------------------------------------------
发件人：Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
发送时间：2018年11月22日(星期四) 02:47
收件人：王卓(卓仁) <zhuoren.wz at alibaba-inc.com>; panama-dev <panama-dev at openjdk.java.net>
抄　送：周经森(英通) <kingsum.kc at alibaba-inc.com>; 李三红(三红) <sanhong.lsh at alibaba-inc.com>
主　题：Re: Re：[Vector]fromArray/allTrue performance

Zhuoren,

> Basically, t is caused by type of a PhiNode is used but not set.
> To reproduce the crash, we must modify JDK code. The change is pure java 
> but can cause crash in C2 compiler thread. Sorry that I cannot reproduce 
> this crash without changing JDK, because getBits is not opposed.

Thanks a lot for the test case and detailed instructions!

Does the following patch fix the crash for you?

diff --git a/src/hotspot/share/opto/library_call.cpp 
b/src/hotspot/share/opto/library_call.cpp
--- a/src/hotspot/share/opto/library_call.cpp
+++ b/src/hotspot/share/opto/library_call.cpp
@@ -8038,7 +8038,7 @@
      default: fatal("%s", type2name(elem_bt)); break;
    }

-  Node* operation = VectorInsertNode::make(opd, insert_val, 
idx->get_con());
+  Node* operation = _gvn.transform(VectorInsertNode::make(opd, 
insert_val, idx->get_con()));
    operation = box_vector(operation, vbox_type, elem_bt, num_elem);
    set_vector_result(operation);

I'll push it on your behalf if it fixes the bug.

I'm curious why unit tests don't catch it.

> *Another workaround*for for Long128VectorfromArray using masks
> This patch can fix the performance issue. With this patch, 
> VectorMatrixUpdateMaskLong.java is 10 - 15 times faster. Of cause 
> /mask0 = longSpecies.maskFromArray(AisNull0, i); /must be moved out of 
> main loop to get rid of maskFromArray performance issue.
> Please give advice on this workaround.

Proper way to fix the issue is to introduce missing support on C2 side.
Feel free to take care of it. (If anybody is already on it, please, 
speak up!)

Meanwhile you can extract Long128Vector.fromArray() into a helper method 
and use it instead.

Best regards,
Vladimir Ivanov


> 
> diff -r a701d05cc2eb src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Long128Vector.java
> --- a/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Long128Vector.java    Thu Nov 15 17:18:03 2018 -0800
> +++ b/src/jdk.incubator.vector/share/classes/jdk/incubator/vector/Long128Vector.java    Tue Nov 20 16:45:18 2018 +0800
> @@ -42,6 +42,10 @@
> 
>       static final Long128Vector ZERO = new Long128Vector();
> 
> +    static final Long128Vector ZEROONE = allocZeroOne();
> +
> +    static final Long128Vector ONEZERO = allocOneZero();
> +
>       static final int LENGTH = SPECIES.length();
> 
>       private final long[] vec; // Don't access directly, use getElements() instead.
> @@ -58,6 +62,20 @@
>           vec = v;
>       }
> 
> +    static private Long128Vector allocZeroOne() {
> +        long[] zeroOneArray = new long[2];
> +        zeroOneArray[0] = 0xffffffffffffffffl;
> +        zeroOneArray[1] = 0;
> +        return new Long128Vector(zeroOneArray);
> +    }
> +
> +    static private Long128Vector allocOneZero() {
> +        long[] oneZeroArray = new long[2];
> +        oneZeroArray[0] = 0;
> +        oneZeroArray[1] = 0xffffffffffffffffl;
> +        return new Long128Vector(oneZeroArray);
> +    }
> +
>       @Override
>       public int length() { return LENGTH; }
> 
> @@ -1323,7 +1341,20 @@
>           @Override
>           @ForceInline
>           public Long128Vector fromArray(long[] a, int ax, Mask<Long, Shapes.S128Bit> m) {
> -            return zero().blend(fromArray(a, ax), m);
> +            boolean[] bits = ((Long128Mask)m).getBits();
> +            if (bits[0] == false) {
> +                if (bits[1] == false) { // mask is 0 0
> +                    return SPECIES.zero();
> +                } else {                // mask is 0 1
> +                    return SPECIES.fromArray(a, ax).and(ONEZERO);
> +                }
> +            } else {
> +                if (bits[1] == true) {  // mask is 1 1
> +                    return SPECIES.fromArray(a, ax);
> +                } else {                // mask is 0 1
> +                    return SPECIES.fromArray(a, ax).and(ZEROONE);
> +                }
> +            }
>           }
> 
>           @Override
> 
> Regards,
> Zhuoren
> 
>     ------------------------------------------------------------------
>     发件人：Vladimir Ivanov <vladimir.x.ivanov at oracle.com>
>     发送时间：2018年11月20日(星期二) 09:39
>     收件人：王卓(卓仁) <zhuoren.wz at alibaba-inc.com>; panama-dev
>     <panama-dev at openjdk.java.net>
>     主 题：Re: [Vector]fromArray/allTrue performance
> 
>     Zhuo, thanks for the feedback!
> 
>     Unfortunately, it seems the attachment was stripped by mail server.
>     Do you mind resending it inline?
> 
>     I'll let Intel folks comment on individual cases you refer to, but
>     overall everything marked as "Implementation limitation" is a
>     work-in-progress and will be addressed later at some point.
> 
>     Best regards,
>     Vladimir Ivanov
> 
>     On 19/11/2018 02:41, 王卓(卓仁) wrote:
>      > Hello,
>      > I am Zhuoren from Alibaba JVM team. Glad to take part in project panama.
>      > We have integrated Vector API to Alibaba JDK, and another Alibaba team is now using Vector API to optimize their applications.
>      > Here are some issues we found in our previous work and blocked future optimization, and I am searching for solutions.
>      >
>      > 1. The performance of Long128Species::fromArray(long[] a, int ax, Mask<Long, Shapes.S128Bit> m) is bad. The attached java file is a test for this API.
>      > I checked the performance issue is due to intrinsic failure.
>      > x86.ad:
>      >          case Op_VectorLoadMask:
>      >            if (UseSSE <= 3) { ret_value = false; }
>      >            else if (vlen == 1 || vlen == 2) { ret_value = false; } // Implementation limitation
>      >            else if (size_in_bits >= 256 && UseAVX < 2) { ret_value = false; } // Implementation limitation
>      >            break;
>      >
>      > I wonder if there will be a fix for this issue, because this API is very important to our optimizations. The fromArray.patch is the workaround we are using. How to improve this workaround is also welcome.
>      >
>      > 2. anyTrue/allTrue on 512 bit
>      > This is another intrinsic failure:
>      >          case Op_VectorTest:
>      >            if (UseAVX <= 0) { ret_value = false; }
>      >            else if (size_in_bits != 128 && size_in_bits != 256) { ret_value = false; } // Implementation limitation
>      >            break;
>      > This also blocked some of our 512 bit optimizations, but I have no workaround for this now.
>      >
>      > Please share with me your advice, thanks!
>      >
>      > Regards,
>      > Zhuo
>      > 
>