RFR: JDK-8187601: Unrolling more when SLP auto-vectorization failed

Fri Sep 29 18:10:10 UTC 2017

On 9/29/17 1:25 AM, Zhongwei Yao wrote:
> Hi, Vladimir,
> 
> Sorry for my late response!
> 
> And yes, it solves my case.
> 
> But I found specjvm2008 doesn't have a stable result, especially for
> benchmark case like startup.xxx, scimark.xxx.large etc. And I have
> found obvious performance regress in the rest of benchmark cases. What
> do you think?

You know that you can change run parameters for specjvm2008 to avoid waiting for long to finish.
And you need to run on one node preferable.

Variations in startup is not important in this case. But scimark is important since they show quality of loop optimizations.

Does regression significant? We need more time to investigate it then.

Thanks,
Vladimir

> 
> On 21 September 2017 at 00:18, Vladimir Kozlov
> <vladimir.kozlov at oracle.com> wrote:
>> Nice.
>>
>> Did you verified that it fixed your case?
>>
>> Would be nice to run specjvm2008 to make sure performance did not regress.
>>
>> Thanks,
>> Vladimir
>>
>>
>> On 9/20/17 4:07 AM, Zhongwei Yao wrote:
>>>
>>> Thanks for your suggestions!
>>>
>>> I've updated the patch that uses pass_slp and do_unroll_only flags
>>> without adding a new flag. Please take a look:
>>>
>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.01/
>>>
>>>
>>>
>>> On 20 September 2017 at 01:54, Vladimir Kozlov
>>> <vladimir.kozlov at oracle.com> wrote:
>>>>
>>>>
>>>>
>>>> On 9/18/17 10:59 PM, Zhongwei Yao wrote:
>>>>>
>>>>>
>>>>> Hi, Vladimir,
>>>>>
>>>>> On 19 September 2017 at 00:17, Vladimir Kozlov
>>>>> <vladimir.kozlov at oracle.com> wrote:
>>>>>>
>>>>>>
>>>>>> Why not use existing set_notpassed_slp() instead of
>>>>>> mark_slp_vec_failed()?
>>>>>
>>>>>
>>>>>
>>>>> Due to 2 reasons, I have not chosen existing passed_slp flag:
>>>>
>>>>
>>>>
>>>> My point is that if we don't find vectors in a loop (as in your case) we
>>>> should ignore whole SLP analysis.
>>>>
>>>> In best case scenario SuperWord::unrolling_analysis() should determine if
>>>> there are vectors candidates. For example, check if array's index is
>>>> depend
>>>> on loop's index variable.
>>>>
>>>> An other way is to call SuperWord::unrolling_analysis() only after we did
>>>> vector analysis.
>>>>
>>>> It is more complicated changes and out of scope of this. There is also
>>>> side
>>>> effect I missed before which may prevent using set_notpassed_slp():
>>>> LoopMaxUnroll is changed based on SLP analysis before has_passed_slp()
>>>> check.
>>>>
>>>> Note, set_notpassed_slp() is also used to additional unroll already
>>>> vectorized loops:
>>>>
>>>>
>>>> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/5ab7a67bc155/src/share/vm/opto/superword.cpp#l2421
>>>>
>>>> May be you should also call mark_do_unroll_only() when you set
>>>> set_major_progress() for _packset.length() == 0 to avoid loop_opts_cnt
>>>> problem you pointed. Can you look on this?
>>>>
>>>> I am not against adding new is_slp_vec_failed() but I want first to
>>>> investigate if we can re-use existing functions.
>>>>
>>>> Thanks,
>>>> Vladimir
>>>>
>>>>
>>>>>      1. If we set_notpassed_slp() when _packset.length() == 0 in
>>>>> SuperWord::output(), then in the IdealLoopTree::policy_unroll()
>>>>> checking:
>>>>>
>>>>>       if (cl->has_passed_slp()) {
>>>>>         if (slp_max_unroll_factor >= future_unroll_ct) return true;
>>>>>         // Normal case: loop too big
>>>>>         return false;
>>>>>       }
>>>>>
>>>>>       we will ignore the case: "cl->has_passed_slp() &&
>>>>> slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()"
>>>>> as alos exposed in my patch:
>>>>>
>>>>>       if (cl->has_passed_slp()) {
>>>>>         if (slp_max_unroll_factor >= future_unroll_ct) return true;
>>>>> -    // Normal case: loop too big
>>>>> -    return false;
>>>>> +    // When SLP vectorization failed, we could do more unrolling
>>>>> +    // optimizations if body size is less than limit size. Otherwise,
>>>>> +    // return false due to loop is too big.
>>>>> +    if (!cl->is_slp_vec_failed()) return false;
>>>>>       }
>>>>>
>>>>>       However, I have not found a case to support this condition yet.
>>>>>
>>>>>      2. As replied below, in:
>>>>>>
>>>>>>
>>>>>> -        } else if (cl->is_main_loop()) {
>>>>>> +        } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) {
>>>>>>               sw.transform_loop(lpt, true);
>>>>>
>>>>>
>>>>>          I need to check whether cl->is_slp_vec_failed() is true.Such
>>>>> checking becomes explicit when using SLPAutoVecFailed flag.
>>>>>
>>>>>>
>>>>>> Why you need next additional check?:
>>>>>>
>>>>>> -        } else if (cl->is_main_loop()) {
>>>>>> +        } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) {
>>>>>>               sw.transform_loop(lpt, true);
>>>>>>
>>>>>
>>>>> The additional check prevents the case that when
>>>>> cl->is_slp_vec_failed() is true, then SuperWord::output() will
>>>>> set_major_progress() at the beginning (because _packset.length() == 0
>>>>> is true when cl->is_slp_vec_failed() is true). Then the "phase ideal
>>>>> loop iteration" will not stop untill loop_opts_cnt reachs 0, which is
>>>>> not we want.
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Vladimir
>>>>>>
>>>>>>
>>>>>> On 9/18/17 2:58 AM, Zhongwei Yao wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [Forward from aarch64-port-dev to hotspot-compiler-dev]
>>>>>>>
>>>>>>> Hi, all,
>>>>>>>
>>>>>>> Bug:
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8187601
>>>>>>>
>>>>>>> Webrev:
>>>>>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.00
>>>>>>>
>>>>>>> In the current implementation, the loop unrolling times are determined
>>>>>>> by vector size and element size when SuperWordLoopUnrollAnalysis is
>>>>>>> true (both X86 and aarch64 are true for now).
>>>>>>>
>>>>>>> This unrolling policy generates less optimized code when SLP
>>>>>>> auto-vectorization fails (as following example shows).
>>>>>>>
>>>>>>> In this patch, I modify the current unrolling policy to do more
>>>>>>> unrolling when SLP auto-vectorization fails. So the loop will be
>>>>>>> unrolled until reaching the unroll times limitation.
>>>>>>>
>>>>>>> Here is one example:
>>>>>>>       public static void accessArrayConstants(int[] array) {
>>>>>>>           for (int j = 0; j < 1024; j++) {
>>>>>>>               array[0]++;
>>>>>>>               array[1]++;
>>>>>>>           }
>>>>>>>       }
>>>>>>>
>>>>>>> Before this patch, the loop will be unrolled by 4 times. 4 is
>>>>>>> determined by: AArch64's vector size 128 bits / array element size 32
>>>>>>> bits = 4. On X86, vector size is 256 bits. So the unroll times are 8.
>>>>>>>
>>>>>>> Below is the generated code by C2 on AArch64:
>>>>>>>
>>>>>>> ==== generated code start ====
>>>>>>>       0x0000ffff6caf3180: ldr w10, [x1,#16]   ;
>>>>>>>       0x0000ffff6caf3184: add w13, w10, #0x1
>>>>>>>       0x0000ffff6caf3188: str w13, [x1,#16]   ;
>>>>>>>       0x0000ffff6caf318c: ldr w12, [x1,#20]   ;
>>>>>>>       0x0000ffff6caf3190: add w13, w10, #0x4
>>>>>>>       0x0000ffff6caf3194: add w10, w12, #0x4
>>>>>>>       0x0000ffff6caf3198: str w13, [x1,#16]   ;
>>>>>>>       0x0000ffff6caf319c: add w11, w11, #0x4  ;
>>>>>>>       0x0000ffff6caf31a0: str w10, [x1,#20]   ;
>>>>>>>       0x0000ffff6caf31a4: cmp w11, #0x3fd
>>>>>>>       0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180  ;
>>>>>>> ==== generated code end ====
>>>>>>>
>>>>>>> After applied this patch, it is unrolled 16 times:
>>>>>>>
>>>>>>> ==== generated code start ====
>>>>>>>       0x0000ffffb0aa6100: ldr w10, [x1,#16]   ;
>>>>>>>       0x0000ffffb0aa6104: add w13, w10, #0x1
>>>>>>>       0x0000ffffb0aa6108: str w13, [x1,#16]   ;
>>>>>>>       0x0000ffffb0aa610c: ldr w12, [x1,#20]   ;
>>>>>>>       0x0000ffffb0aa6110: add w13, w10, #0x10
>>>>>>>       0x0000ffffb0aa6114: add w10, w12, #0x10
>>>>>>>       0x0000ffffb0aa6118: str w13, [x1,#16]   ;
>>>>>>>       0x0000ffffb0aa611c: add w11, w11, #0x10  ;
>>>>>>>       0x0000ffffb0aa6120: str w10, [x1,#20]   ;
>>>>>>>       0x0000ffffb0aa6124: cmp w11, #0x3f1
>>>>>>>       0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100  ;
>>>>>>> ==== generated code end ====
>>>>>>>
>>>>>>> This patch passes jtreg tests both on AArch64 and X86.
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>
> 
> 
>