RFR: JDK-8187601: Unrolling more when SLP auto-vectorization failed

Zhongwei Yao zhongwei.yao at linaro.org
Sat Sep 30 06:37:32 UTC 2017


On 30 September 2017 at 02:10, Vladimir Kozlov
<vladimir.kozlov at oracle.com> wrote:
> On 9/29/17 1:25 AM, Zhongwei Yao wrote:
>>
>> Hi, Vladimir,
>>
>> Sorry for my late response!
>>
>> And yes, it solves my case.
>>
>> But I found specjvm2008 doesn't have a stable result, especially for
>> benchmark case like startup.xxx, scimark.xxx.large etc. And I have not
>> found obvious performance regress in the rest of benchmark cases. What
>> do you think?
>
>
> You know that you can change run parameters for specjvm2008 to avoid waiting
> for long to finish.
> And you need to run on one node preferable.
>
> Variations in startup is not important in this case. But scimark is
> important since they show quality of loop optimizations.
>
> Does regression significant? We need more time to investigate it then.

I see performance data fluctuates in specjvm2008.

However, I check the scimark 2.0 (http://math.nist.gov/scimark2/) and
see no performance regression in it both on x86 and arm64.

>
> Thanks,
> Vladimir
>
>
>>
>> On 21 September 2017 at 00:18, Vladimir Kozlov
>> <vladimir.kozlov at oracle.com> wrote:
>>>
>>> Nice.
>>>
>>> Did you verified that it fixed your case?
>>>
>>> Would be nice to run specjvm2008 to make sure performance did not
>>> regress.
>>>
>>> Thanks,
>>> Vladimir
>>>
>>>
>>> On 9/20/17 4:07 AM, Zhongwei Yao wrote:
>>>>
>>>>
>>>> Thanks for your suggestions!
>>>>
>>>> I've updated the patch that uses pass_slp and do_unroll_only flags
>>>> without adding a new flag. Please take a look:
>>>>
>>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.01/
>>>>
>>>>
>>>>
>>>> On 20 September 2017 at 01:54, Vladimir Kozlov
>>>> <vladimir.kozlov at oracle.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 9/18/17 10:59 PM, Zhongwei Yao wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi, Vladimir,
>>>>>>
>>>>>> On 19 September 2017 at 00:17, Vladimir Kozlov
>>>>>> <vladimir.kozlov at oracle.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Why not use existing set_notpassed_slp() instead of
>>>>>>> mark_slp_vec_failed()?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Due to 2 reasons, I have not chosen existing passed_slp flag:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> My point is that if we don't find vectors in a loop (as in your case)
>>>>> we
>>>>> should ignore whole SLP analysis.
>>>>>
>>>>> In best case scenario SuperWord::unrolling_analysis() should determine
>>>>> if
>>>>> there are vectors candidates. For example, check if array's index is
>>>>> depend
>>>>> on loop's index variable.
>>>>>
>>>>> An other way is to call SuperWord::unrolling_analysis() only after we
>>>>> did
>>>>> vector analysis.
>>>>>
>>>>> It is more complicated changes and out of scope of this. There is also
>>>>> side
>>>>> effect I missed before which may prevent using set_notpassed_slp():
>>>>> LoopMaxUnroll is changed based on SLP analysis before has_passed_slp()
>>>>> check.
>>>>>
>>>>> Note, set_notpassed_slp() is also used to additional unroll already
>>>>> vectorized loops:
>>>>>
>>>>>
>>>>>
>>>>> http://hg.openjdk.java.net/jdk10/hs/hotspot/file/5ab7a67bc155/src/share/vm/opto/superword.cpp#l2421
>>>>>
>>>>> May be you should also call mark_do_unroll_only() when you set
>>>>> set_major_progress() for _packset.length() == 0 to avoid loop_opts_cnt
>>>>> problem you pointed. Can you look on this?
>>>>>
>>>>> I am not against adding new is_slp_vec_failed() but I want first to
>>>>> investigate if we can re-use existing functions.
>>>>>
>>>>> Thanks,
>>>>> Vladimir
>>>>>
>>>>>
>>>>>>      1. If we set_notpassed_slp() when _packset.length() == 0 in
>>>>>> SuperWord::output(), then in the IdealLoopTree::policy_unroll()
>>>>>> checking:
>>>>>>
>>>>>>       if (cl->has_passed_slp()) {
>>>>>>         if (slp_max_unroll_factor >= future_unroll_ct) return true;
>>>>>>         // Normal case: loop too big
>>>>>>         return false;
>>>>>>       }
>>>>>>
>>>>>>       we will ignore the case: "cl->has_passed_slp() &&
>>>>>> slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()"
>>>>>> as alos exposed in my patch:
>>>>>>
>>>>>>       if (cl->has_passed_slp()) {
>>>>>>         if (slp_max_unroll_factor >= future_unroll_ct) return true;
>>>>>> -    // Normal case: loop too big
>>>>>> -    return false;
>>>>>> +    // When SLP vectorization failed, we could do more unrolling
>>>>>> +    // optimizations if body size is less than limit size. Otherwise,
>>>>>> +    // return false due to loop is too big.
>>>>>> +    if (!cl->is_slp_vec_failed()) return false;
>>>>>>       }
>>>>>>
>>>>>>       However, I have not found a case to support this condition yet.
>>>>>>
>>>>>>      2. As replied below, in:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -        } else if (cl->is_main_loop()) {
>>>>>>> +        } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) {
>>>>>>>               sw.transform_loop(lpt, true);
>>>>>>
>>>>>>
>>>>>>
>>>>>>          I need to check whether cl->is_slp_vec_failed() is true.Such
>>>>>> checking becomes explicit when using SLPAutoVecFailed flag.
>>>>>>
>>>>>>>
>>>>>>> Why you need next additional check?:
>>>>>>>
>>>>>>> -        } else if (cl->is_main_loop()) {
>>>>>>> +        } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) {
>>>>>>>               sw.transform_loop(lpt, true);
>>>>>>>
>>>>>>
>>>>>> The additional check prevents the case that when
>>>>>> cl->is_slp_vec_failed() is true, then SuperWord::output() will
>>>>>> set_major_progress() at the beginning (because _packset.length() == 0
>>>>>> is true when cl->is_slp_vec_failed() is true). Then the "phase ideal
>>>>>> loop iteration" will not stop untill loop_opts_cnt reachs 0, which is
>>>>>> not we want.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Vladimir
>>>>>>>
>>>>>>>
>>>>>>> On 9/18/17 2:58 AM, Zhongwei Yao wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> [Forward from aarch64-port-dev to hotspot-compiler-dev]
>>>>>>>>
>>>>>>>> Hi, all,
>>>>>>>>
>>>>>>>> Bug:
>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8187601
>>>>>>>>
>>>>>>>> Webrev:
>>>>>>>> http://cr.openjdk.java.net/~zyao/8187601/webrev.00
>>>>>>>>
>>>>>>>> In the current implementation, the loop unrolling times are
>>>>>>>> determined
>>>>>>>> by vector size and element size when SuperWordLoopUnrollAnalysis is
>>>>>>>> true (both X86 and aarch64 are true for now).
>>>>>>>>
>>>>>>>> This unrolling policy generates less optimized code when SLP
>>>>>>>> auto-vectorization fails (as following example shows).
>>>>>>>>
>>>>>>>> In this patch, I modify the current unrolling policy to do more
>>>>>>>> unrolling when SLP auto-vectorization fails. So the loop will be
>>>>>>>> unrolled until reaching the unroll times limitation.
>>>>>>>>
>>>>>>>> Here is one example:
>>>>>>>>       public static void accessArrayConstants(int[] array) {
>>>>>>>>           for (int j = 0; j < 1024; j++) {
>>>>>>>>               array[0]++;
>>>>>>>>               array[1]++;
>>>>>>>>           }
>>>>>>>>       }
>>>>>>>>
>>>>>>>> Before this patch, the loop will be unrolled by 4 times. 4 is
>>>>>>>> determined by: AArch64's vector size 128 bits / array element size
>>>>>>>> 32
>>>>>>>> bits = 4. On X86, vector size is 256 bits. So the unroll times are
>>>>>>>> 8.
>>>>>>>>
>>>>>>>> Below is the generated code by C2 on AArch64:
>>>>>>>>
>>>>>>>> ==== generated code start ====
>>>>>>>>       0x0000ffff6caf3180: ldr w10, [x1,#16]   ;
>>>>>>>>       0x0000ffff6caf3184: add w13, w10, #0x1
>>>>>>>>       0x0000ffff6caf3188: str w13, [x1,#16]   ;
>>>>>>>>       0x0000ffff6caf318c: ldr w12, [x1,#20]   ;
>>>>>>>>       0x0000ffff6caf3190: add w13, w10, #0x4
>>>>>>>>       0x0000ffff6caf3194: add w10, w12, #0x4
>>>>>>>>       0x0000ffff6caf3198: str w13, [x1,#16]   ;
>>>>>>>>       0x0000ffff6caf319c: add w11, w11, #0x4  ;
>>>>>>>>       0x0000ffff6caf31a0: str w10, [x1,#20]   ;
>>>>>>>>       0x0000ffff6caf31a4: cmp w11, #0x3fd
>>>>>>>>       0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180  ;
>>>>>>>> ==== generated code end ====
>>>>>>>>
>>>>>>>> After applied this patch, it is unrolled 16 times:
>>>>>>>>
>>>>>>>> ==== generated code start ====
>>>>>>>>       0x0000ffffb0aa6100: ldr w10, [x1,#16]   ;
>>>>>>>>       0x0000ffffb0aa6104: add w13, w10, #0x1
>>>>>>>>       0x0000ffffb0aa6108: str w13, [x1,#16]   ;
>>>>>>>>       0x0000ffffb0aa610c: ldr w12, [x1,#20]   ;
>>>>>>>>       0x0000ffffb0aa6110: add w13, w10, #0x10
>>>>>>>>       0x0000ffffb0aa6114: add w10, w12, #0x10
>>>>>>>>       0x0000ffffb0aa6118: str w13, [x1,#16]   ;
>>>>>>>>       0x0000ffffb0aa611c: add w11, w11, #0x10  ;
>>>>>>>>       0x0000ffffb0aa6120: str w10, [x1,#20]   ;
>>>>>>>>       0x0000ffffb0aa6124: cmp w11, #0x3f1
>>>>>>>>       0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100  ;
>>>>>>>> ==== generated code end ====
>>>>>>>>
>>>>>>>> This patch passes jtreg tests both on AArch64 and X86.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>



-- 
Best regards,
Zhongwei


More information about the hotspot-compiler-dev mailing list