[aarch64-port-dev ] RFR: JDK-8187601: Unrolling more when SLP auto-vectorization failed

Fri Sep 29 09:22:24 UTC 2017

I made a typo in the previous reply.

On 29 September 2017 at 16:25, Zhongwei Yao <zhongwei.yao at linaro.org> wrote:
> Hi, Vladimir,
>
> Sorry for my late response!
>
> And yes, it solves my case.
>
> But I found specjvm2008 doesn't have a stable result, especially for
> benchmark case like startup.xxx, scimark.xxx.large etc. And I have
> found obvious performance regress in the rest of benchmark cases. What

And I have NOT found obvious performance regress in the rest of benchmark cases.

> do you think?
>
> On 21 September 2017 at 00:18, Vladimir Kozlov
> <vladimir.kozlov at oracle.com> wrote:
>> Nice.
>>
>> Did you verified that it fixed your case?
>>
>> Would be nice to run specjvm2008 to make sure performance did not regress.
>>
>> Thanks,
>> Vladimir
>>
>>
>> On 9/20/17 4:07 AM, Zhongwei Yao wrote:
>>>
>>> Thanks for your suggestions!
>>>
>>> I've updated the patch that uses pass_slp and do_unroll_only flags
>>> without adding a new flag. Please take a look:
>>>
>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.01/
>>>
>>>
>>>
>>> On 20 September 2017 at 01:54, Vladimir Kozlov
>>> <vladimir.kozlov at oracle.com> wrote:
>>>>
>>>>
>>>>
>>>> On 9/18/17 10:59 PM, Zhongwei Yao wrote:
>>>>>
>>>>>
>>>>> Hi, Vladimir,
>>>>>
>>>>> On 19 September 2017 at 00:17, Vladimir Kozlov
>>>>> <vladimir.kozlov at oracle.com> wrote:
>>>>>>
>>>>>>
>>>>>> Why not use existing set_notpassed_slp() instead of
>>>>>> mark_slp_vec_failed()?
>>>>>
>>>>>
>>>>>
>>>>> Due to 2 reasons, I have not chosen existing passed_slp flag:
>>>>
>>>>
>>>>
>>>> My point is that if we don't find vectors in a loop (as in your case) we
>>>> should ignore whole SLP analysis.
>>>>
>>>> In best case scenario SuperWord::unrolling_analysis() should determine if
>>>> there are vectors candidates. For example, check if array's index is
>>>> depend
>>>> on loop's index variable.
>>>>
>>>> An other way is to call SuperWord::unrolling_analysis() only after we did
>>>> vector analysis.
>>>>
>>>> It is more complicated changes and out of scope of this. There is also
>>>> side
>>>> effect I missed before which may prevent using set_notpassed_slp():
>>>> LoopMaxUnroll is changed based on SLP analysis before has_passed_slp()
>>>> check.
>>>>
>>>> Note, set_notpassed_slp() is also used to additional unroll already
>>>> vectorized loops:
>>>>
>>>>
>>>> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/5ab7a67bc155/src/share/vm/opto/superword.cpp#l2421
>>>>
>>>> May be you should also call mark_do_unroll_only() when you set
>>>> set_major_progress() for _packset.length() == 0 to avoid loop_opts_cnt
>>>> problem you pointed. Can you look on this?
>>>>
>>>> I am not against adding new is_slp_vec_failed() but I want first to
>>>> investigate if we can re-use existing functions.
>>>>
>>>> Thanks,
>>>> Vladimir
>>>>
>>>>
>>>>>     1. If we set_notpassed_slp() when _packset.length() == 0 in
>>>>> SuperWord::output(), then in the IdealLoopTree::policy_unroll()
>>>>> checking:
>>>>>
>>>>>      if (cl->has_passed_slp()) {
>>>>>        if (slp_max_unroll_factor >= future_unroll_ct) return true;
>>>>>        // Normal case: loop too big
>>>>>        return false;
>>>>>      }
>>>>>
>>>>>      we will ignore the case: "cl->has_passed_slp() &&
>>>>> slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()"
>>>>> as alos exposed in my patch:
>>>>>
>>>>>      if (cl->has_passed_slp()) {
>>>>>        if (slp_max_unroll_factor >= future_unroll_ct) return true;
>>>>> -    // Normal case: loop too big
>>>>> -    return false;
>>>>> +    // When SLP vectorization failed, we could do more unrolling
>>>>> +    // optimizations if body size is less than limit size. Otherwise,
>>>>> +    // return false due to loop is too big.
>>>>> +    if (!cl->is_slp_vec_failed()) return false;
>>>>>      }
>>>>>
>>>>>      However, I have not found a case to support this condition yet.
>>>>>
>>>>>     2. As replied below, in:
>>>>>>
>>>>>>
>>>>>> -        } else if (cl->is_main_loop()) {
>>>>>> +        } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) {
>>>>>>              sw.transform_loop(lpt, true);
>>>>>
>>>>>
>>>>>         I need to check whether cl->is_slp_vec_failed() is true.Such
>>>>> checking becomes explicit when using SLPAutoVecFailed flag.
>>>>>
>>>>>>
>>>>>> Why you need next additional check?:
>>>>>>
>>>>>> -        } else if (cl->is_main_loop()) {
>>>>>> +        } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) {
>>>>>>              sw.transform_loop(lpt, true);
>>>>>>
>>>>>
>>>>> The additional check prevents the case that when
>>>>> cl->is_slp_vec_failed() is true, then SuperWord::output() will
>>>>> set_major_progress() at the beginning (because _packset.length() == 0
>>>>> is true when cl->is_slp_vec_failed() is true). Then the "phase ideal
>>>>> loop iteration" will not stop untill loop_opts_cnt reachs 0, which is
>>>>> not we want.
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Vladimir
>>>>>>
>>>>>>
>>>>>> On 9/18/17 2:58 AM, Zhongwei Yao wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [Forward from aarch64-port-dev to hotspot-compiler-dev]
>>>>>>>
>>>>>>> Hi, all,
>>>>>>>
>>>>>>> Bug:
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8187601
>>>>>>>
>>>>>>> Webrev:
>>>>>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.00
>>>>>>>
>>>>>>> In the current implementation, the loop unrolling times are determined
>>>>>>> by vector size and element size when SuperWordLoopUnrollAnalysis is
>>>>>>> true (both X86 and aarch64 are true for now).
>>>>>>>
>>>>>>> This unrolling policy generates less optimized code when SLP
>>>>>>> auto-vectorization fails (as following example shows).
>>>>>>>
>>>>>>> In this patch, I modify the current unrolling policy to do more
>>>>>>> unrolling when SLP auto-vectorization fails. So the loop will be
>>>>>>> unrolled until reaching the unroll times limitation.
>>>>>>>
>>>>>>> Here is one example:
>>>>>>>      public static void accessArrayConstants(int[] array) {
>>>>>>>          for (int j = 0; j < 1024; j++) {
>>>>>>>              array[0]++;
>>>>>>>              array[1]++;
>>>>>>>          }
>>>>>>>      }
>>>>>>>
>>>>>>> Before this patch, the loop will be unrolled by 4 times. 4 is
>>>>>>> determined by: AArch64's vector size 128 bits / array element size 32
>>>>>>> bits = 4. On X86, vector size is 256 bits. So the unroll times are 8.
>>>>>>>
>>>>>>> Below is the generated code by C2 on AArch64:
>>>>>>>
>>>>>>> ==== generated code start ====
>>>>>>>      0x0000ffff6caf3180: ldr w10, [x1,#16]   ;
>>>>>>>      0x0000ffff6caf3184: add w13, w10, #0x1
>>>>>>>      0x0000ffff6caf3188: str w13, [x1,#16]   ;
>>>>>>>      0x0000ffff6caf318c: ldr w12, [x1,#20]   ;
>>>>>>>      0x0000ffff6caf3190: add w13, w10, #0x4
>>>>>>>      0x0000ffff6caf3194: add w10, w12, #0x4
>>>>>>>      0x0000ffff6caf3198: str w13, [x1,#16]   ;
>>>>>>>      0x0000ffff6caf319c: add w11, w11, #0x4  ;
>>>>>>>      0x0000ffff6caf31a0: str w10, [x1,#20]   ;
>>>>>>>      0x0000ffff6caf31a4: cmp w11, #0x3fd
>>>>>>>      0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180  ;
>>>>>>> ==== generated code end ====
>>>>>>>
>>>>>>> After applied this patch, it is unrolled 16 times:
>>>>>>>
>>>>>>> ==== generated code start ====
>>>>>>>      0x0000ffffb0aa6100: ldr w10, [x1,#16]   ;
>>>>>>>      0x0000ffffb0aa6104: add w13, w10, #0x1
>>>>>>>      0x0000ffffb0aa6108: str w13, [x1,#16]   ;
>>>>>>>      0x0000ffffb0aa610c: ldr w12, [x1,#20]   ;
>>>>>>>      0x0000ffffb0aa6110: add w13, w10, #0x10
>>>>>>>      0x0000ffffb0aa6114: add w10, w12, #0x10
>>>>>>>      0x0000ffffb0aa6118: str w13, [x1,#16]   ;
>>>>>>>      0x0000ffffb0aa611c: add w11, w11, #0x10  ;
>>>>>>>      0x0000ffffb0aa6120: str w10, [x1,#20]   ;
>>>>>>>      0x0000ffffb0aa6124: cmp w11, #0x3f1
>>>>>>>      0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100  ;
>>>>>>> ==== generated code end ====
>>>>>>>
>>>>>>> This patch passes jtreg tests both on AArch64 and X86.
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>
>
>
>
> --
> Best regards,
> Zhongwei

-- 
Best regards,
Zhongwei