RFR: JDK-8187601: Unrolling more when SLP auto-vectorization failed

Tue Sep 19 17:54:49 UTC 2017

On 9/18/17 10:59 PM, Zhongwei Yao wrote:
> Hi, Vladimir,
> 
> On 19 September 2017 at 00:17, Vladimir Kozlov
> <vladimir.kozlov at oracle.com> wrote:
>> Why not use existing set_notpassed_slp() instead of mark_slp_vec_failed()?
> 
> Due to 2 reasons, I have not chosen existing passed_slp flag:

My point is that if we don't find vectors in a loop (as in your case) we 
should ignore whole SLP analysis.

In best case scenario SuperWord::unrolling_analysis() should determine 
if there are vectors candidates. For example, check if array's index is 
depend on loop's index variable.

An other way is to call SuperWord::unrolling_analysis() only after we 
did vector analysis.

It is more complicated changes and out of scope of this. There is also 
side effect I missed before which may prevent using set_notpassed_slp(): 
LoopMaxUnroll is changed based on SLP analysis before has_passed_slp() 
check.

Note, set_notpassed_slp() is also used to additional unroll already 
vectorized loops:

http://hg.openjdk.java.net/jdk10/hs/hotspot/file/5ab7a67bc155/src/share/vm/opto/superword.cpp#l2421

May be you should also call mark_do_unroll_only() when you set 
set_major_progress() for _packset.length() == 0 to avoid loop_opts_cnt 
problem you pointed. Can you look on this?

I am not against adding new is_slp_vec_failed() but I want first to 
investigate if we can re-use existing functions.

Thanks,
Vladimir

>    1. If we set_notpassed_slp() when _packset.length() == 0 in
> SuperWord::output(), then in the IdealLoopTree::policy_unroll()
> checking:
> 
>     if (cl->has_passed_slp()) {
>       if (slp_max_unroll_factor >= future_unroll_ct) return true;
>       // Normal case: loop too big
>       return false;
>     }
> 
>     we will ignore the case: "cl->has_passed_slp() &&
> slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()"
> as alos exposed in my patch:
> 
>     if (cl->has_passed_slp()) {
>       if (slp_max_unroll_factor >= future_unroll_ct) return true;
> -    // Normal case: loop too big
> -    return false;
> +    // When SLP vectorization failed, we could do more unrolling
> +    // optimizations if body size is less than limit size. Otherwise,
> +    // return false due to loop is too big.
> +    if (!cl->is_slp_vec_failed()) return false;
>     }
> 
>     However, I have not found a case to support this condition yet.
> 
>    2. As replied below, in:
>> -        } else if (cl->is_main_loop()) {
>> +        } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) {
>>             sw.transform_loop(lpt, true);
>        I need to check whether cl->is_slp_vec_failed() is true.Such
> checking becomes explicit when using SLPAutoVecFailed flag.
> 
>>
>> Why you need next additional check?:
>>
>> -        } else if (cl->is_main_loop()) {
>> +        } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) {
>>             sw.transform_loop(lpt, true);
>>
> 
> The additional check prevents the case that when
> cl->is_slp_vec_failed() is true, then SuperWord::output() will
> set_major_progress() at the beginning (because _packset.length() == 0
> is true when cl->is_slp_vec_failed() is true). Then the "phase ideal
> loop iteration" will not stop untill loop_opts_cnt reachs 0, which is
> not we want.

> 
>>
>> Thanks,
>> Vladimir
>>
>>
>> On 9/18/17 2:58 AM, Zhongwei Yao wrote:
>>>
>>> [Forward from aarch64-port-dev to hotspot-compiler-dev]
>>>
>>> Hi, all,
>>>
>>> Bug:
>>> https://bugs.openjdk.java.net/browse/JDK-8187601
>>>
>>> Webrev:
>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.00
>>>
>>> In the current implementation, the loop unrolling times are determined
>>> by vector size and element size when SuperWordLoopUnrollAnalysis is
>>> true (both X86 and aarch64 are true for now).
>>>
>>> This unrolling policy generates less optimized code when SLP
>>> auto-vectorization fails (as following example shows).
>>>
>>> In this patch, I modify the current unrolling policy to do more
>>> unrolling when SLP auto-vectorization fails. So the loop will be
>>> unrolled until reaching the unroll times limitation.
>>>
>>> Here is one example:
>>>     public static void accessArrayConstants(int[] array) {
>>>         for (int j = 0; j < 1024; j++) {
>>>             array[0]++;
>>>             array[1]++;
>>>         }
>>>     }
>>>
>>> Before this patch, the loop will be unrolled by 4 times. 4 is
>>> determined by: AArch64's vector size 128 bits / array element size 32
>>> bits = 4. On X86, vector size is 256 bits. So the unroll times are 8.
>>>
>>> Below is the generated code by C2 on AArch64:
>>>
>>> ==== generated code start ====
>>>     0x0000ffff6caf3180: ldr w10, [x1,#16]   ;
>>>     0x0000ffff6caf3184: add w13, w10, #0x1
>>>     0x0000ffff6caf3188: str w13, [x1,#16]   ;
>>>     0x0000ffff6caf318c: ldr w12, [x1,#20]   ;
>>>     0x0000ffff6caf3190: add w13, w10, #0x4
>>>     0x0000ffff6caf3194: add w10, w12, #0x4
>>>     0x0000ffff6caf3198: str w13, [x1,#16]   ;
>>>     0x0000ffff6caf319c: add w11, w11, #0x4  ;
>>>     0x0000ffff6caf31a0: str w10, [x1,#20]   ;
>>>     0x0000ffff6caf31a4: cmp w11, #0x3fd
>>>     0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180  ;
>>> ==== generated code end ====
>>>
>>> After applied this patch, it is unrolled 16 times:
>>>
>>> ==== generated code start ====
>>>     0x0000ffffb0aa6100: ldr w10, [x1,#16]   ;
>>>     0x0000ffffb0aa6104: add w13, w10, #0x1
>>>     0x0000ffffb0aa6108: str w13, [x1,#16]   ;
>>>     0x0000ffffb0aa610c: ldr w12, [x1,#20]   ;
>>>     0x0000ffffb0aa6110: add w13, w10, #0x10
>>>     0x0000ffffb0aa6114: add w10, w12, #0x10
>>>     0x0000ffffb0aa6118: str w13, [x1,#16]   ;
>>>     0x0000ffffb0aa611c: add w11, w11, #0x10  ;
>>>     0x0000ffffb0aa6120: str w10, [x1,#20]   ;
>>>     0x0000ffffb0aa6124: cmp w11, #0x3f1
>>>     0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100  ;
>>> ==== generated code end ====
>>>
>>> This patch passes jtreg tests both on AArch64 and X86.
>>>
>>
> 
> 
>