RFR: JDK-8187601: Unrolling more when SLP auto-vectorization failed
Zhongwei Yao
zhongwei.yao at linaro.org
Tue Sep 19 05:59:18 UTC 2017
Hi, Vladimir,
On 19 September 2017 at 00:17, Vladimir Kozlov
<vladimir.kozlov at oracle.com> wrote:
> Why not use existing set_notpassed_slp() instead of mark_slp_vec_failed()?
Due to 2 reasons, I have not chosen existing passed_slp flag:
1. If we set_notpassed_slp() when _packset.length() == 0 in
SuperWord::output(), then in the IdealLoopTree::policy_unroll()
checking:
if (cl->has_passed_slp()) {
if (slp_max_unroll_factor >= future_unroll_ct) return true;
// Normal case: loop too big
return false;
}
we will ignore the case: "cl->has_passed_slp() &&
slp_max_unroll_factor < future_unroll_ct && !cl->is_slp_vec_failed()"
as alos exposed in my patch:
if (cl->has_passed_slp()) {
if (slp_max_unroll_factor >= future_unroll_ct) return true;
- // Normal case: loop too big
- return false;
+ // When SLP vectorization failed, we could do more unrolling
+ // optimizations if body size is less than limit size. Otherwise,
+ // return false due to loop is too big.
+ if (!cl->is_slp_vec_failed()) return false;
}
However, I have not found a case to support this condition yet.
2. As replied below, in:
> - } else if (cl->is_main_loop()) {
> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) {
> sw.transform_loop(lpt, true);
I need to check whether cl->is_slp_vec_failed() is true.Such
checking becomes explicit when using SLPAutoVecFailed flag.
>
> Why you need next additional check?:
>
> - } else if (cl->is_main_loop()) {
> + } else if (cl->is_main_loop() && !cl->is_slp_vec_failed()) {
> sw.transform_loop(lpt, true);
>
The additional check prevents the case that when
cl->is_slp_vec_failed() is true, then SuperWord::output() will
set_major_progress() at the beginning (because _packset.length() == 0
is true when cl->is_slp_vec_failed() is true). Then the "phase ideal
loop iteration" will not stop untill loop_opts_cnt reachs 0, which is
not we want.
>
> Thanks,
> Vladimir
>
>
> On 9/18/17 2:58 AM, Zhongwei Yao wrote:
>>
>> [Forward from aarch64-port-dev to hotspot-compiler-dev]
>>
>> Hi, all,
>>
>> Bug:
>> https://bugs.openjdk.java.net/browse/JDK-8187601
>>
>> Webrev:
>> http://cr.openjdk.java.net/~zyao/8187601/webrev.00
>>
>> In the current implementation, the loop unrolling times are determined
>> by vector size and element size when SuperWordLoopUnrollAnalysis is
>> true (both X86 and aarch64 are true for now).
>>
>> This unrolling policy generates less optimized code when SLP
>> auto-vectorization fails (as following example shows).
>>
>> In this patch, I modify the current unrolling policy to do more
>> unrolling when SLP auto-vectorization fails. So the loop will be
>> unrolled until reaching the unroll times limitation.
>>
>> Here is one example:
>> public static void accessArrayConstants(int[] array) {
>> for (int j = 0; j < 1024; j++) {
>> array[0]++;
>> array[1]++;
>> }
>> }
>>
>> Before this patch, the loop will be unrolled by 4 times. 4 is
>> determined by: AArch64's vector size 128 bits / array element size 32
>> bits = 4. On X86, vector size is 256 bits. So the unroll times are 8.
>>
>> Below is the generated code by C2 on AArch64:
>>
>> ==== generated code start ====
>> 0x0000ffff6caf3180: ldr w10, [x1,#16] ;
>> 0x0000ffff6caf3184: add w13, w10, #0x1
>> 0x0000ffff6caf3188: str w13, [x1,#16] ;
>> 0x0000ffff6caf318c: ldr w12, [x1,#20] ;
>> 0x0000ffff6caf3190: add w13, w10, #0x4
>> 0x0000ffff6caf3194: add w10, w12, #0x4
>> 0x0000ffff6caf3198: str w13, [x1,#16] ;
>> 0x0000ffff6caf319c: add w11, w11, #0x4 ;
>> 0x0000ffff6caf31a0: str w10, [x1,#20] ;
>> 0x0000ffff6caf31a4: cmp w11, #0x3fd
>> 0x0000ffff6caf31a8: b.lt 0x0000ffff6caf3180 ;
>> ==== generated code end ====
>>
>> After applied this patch, it is unrolled 16 times:
>>
>> ==== generated code start ====
>> 0x0000ffffb0aa6100: ldr w10, [x1,#16] ;
>> 0x0000ffffb0aa6104: add w13, w10, #0x1
>> 0x0000ffffb0aa6108: str w13, [x1,#16] ;
>> 0x0000ffffb0aa610c: ldr w12, [x1,#20] ;
>> 0x0000ffffb0aa6110: add w13, w10, #0x10
>> 0x0000ffffb0aa6114: add w10, w12, #0x10
>> 0x0000ffffb0aa6118: str w13, [x1,#16] ;
>> 0x0000ffffb0aa611c: add w11, w11, #0x10 ;
>> 0x0000ffffb0aa6120: str w10, [x1,#20] ;
>> 0x0000ffffb0aa6124: cmp w11, #0x3f1
>> 0x0000ffffb0aa6128: b.lt 0x0000ffffb0aa6100 ;
>> ==== generated code end ====
>>
>> This patch passes jtreg tests both on AArch64 and X86.
>>
>
--
Best regards,
Zhongwei
More information about the hotspot-compiler-dev
mailing list