RFR: 8227505: SuperWordLoopUnrollAnalysis may lead to over loop unrolling
Jie Fu
fujie at loongson.cn
Mon Aug 12 09:26:37 UTC 2019
Hi Vladimir and all,
Updated: http://cr.openjdk.java.net/~jiefu/8227505/webrev.02/
*Analysis*
The performance drop is caused by the over loop unrolling of
SuperWordLoopUnrollAnalysis, which do not consider the negative effect
of pre/post-loop at all.
The following is the perf stat data for different loop unrolling factors.
Please note that the number of branches increased by ~47% (from
19,854,995,391 to 29,229,714,991) when the unroll-factor increased from
8 to 16.
And the total instructions increased by ~58% (from 108,849,151,185 to
171,334,733,346), which was even worse.
perf stat for unroll-factor=8:
----------------------------------------------------------------------------
5429.117030 task-clock (msec) # 1.006 CPUs utilized
620 context-switches # 0.114 K/sec
11 cpu-migrations # 0.002 K/sec
41,905 page-faults # 0.008 M/sec
24,176,919,686 cycles # 4.453 GHz
108,849,151,185 instructions # 4.50 insn per cycle
19,854,995,391 branches # 3657.132 M/sec
17,788,819 branch-misses # 0.09% of all
branches
5.396099347 seconds time elapsed
----------------------------------------------------------------------------
perf stat for unroll-factor=16:
----------------------------------------------------------------------------
9158.323771 task-clock (msec) # 1.005 CPUs utilized
763 context-switches # 0.083 K/sec
16 cpu-migrations # 0.002 K/sec
41,884 page-faults # 0.005 M/sec
40,831,102,837 cycles # 4.458 GHz
171,334,733,346 instructions # 4.20 insn per cycle
29,229,714,991 branches # 3191.601 M/sec
16,455,010 branch-misses # 0.06% of all
branches
9.115309970 seconds time elapsed
----------------------------------------------------------------------------
The increment of branches and total instructions was mainly introduced
by the pre- and post-loop.
1) Higher unroll-factor may lead to more iterations in the pre-loop due
to the alignment requirement in the main-loop.
For example, with unroll-factor=16, 16 iterations may be required in
the pre-loop since 16-byte vector instructions were used in the main-loop.
However, no more than 8 iterations when unroll-factor=8.
2) Higher unroll-factor may lead to more iterations in the post-loop
since the range of it is [0, unroll-factor - 1).
As for the particular case, the distribution of iterations for the
pre-/main-/post- loops seem to be:
-----------------------------------------------------------------------
| pre-lp iters | main-lp iters | post-lp iters | total iters
-----------------------------------------------------------------------
unroll(8) | 8 | 6 | 5 | 19
-----------------------------------------------------------------------
unroll(16)| 16 | 2 | 13 | 31
-----------------------------------------------------------------------
So it's harmful to unroll with 16.
*Fix*
The loop body size seems unable to detect this case.
When the VM tries to decide whether to unroll 16, the loop body size is
just 64, which seems quite reasonable for SuperWordLoopUnrollAnalysis.
And the generated loop body in assembly is small enough with unroll 16.
----------------------------------------------------------------------------
;; B18: # out( B18 B19 ) <- in( B17 B18 ) Loop( B18-B18 inner
main of N59) Freq: 891835
0x00007f62187bfba0: movslq %r8d,%r11
0x00007f62187bfba3: vmovdqu 0x10(%r9,%r11,1),%xmm0
0x00007f62187bfbaa: vmovdqu %xmm0,0x10(%rsi,%r11,1) ;*bastore
{reexecute=0 rethrow=0 return_oop=0}
; -
TestSuperWordOverunrolling::execute at 56 (line 21)
0x00007f62187bfbb1: add $0x10,%r8d ;*iinc
{reexecute=0 rethrow=0 return_oop=0}
; -
TestSuperWordOverunrolling::execute at 57 (line 20)
0x00007f62187bfbb5: cmp $0x2f,%r8d
0x00007f62187bfbb9: jl 0x00007f62187bfba0 ;*if_icmpge
{reexecute=0 rethrow=0 return_oop=0}
; -
TestSuperWordOverunrolling::execute at 38 (line 20)
----------------------------------------------------------------------------
To fix it, the possible negative effect of pre-/post-loop should be
considered.
The unrolling may increase the performance if the total iterations of
pre/main/post loops could be decreased.
However, the precise number of iterations in the pre-/post-loop is
really hard to predict since it depends on many factors such as align
requirement, number of data&type, object layout, and allocated addresses.
To simplify the problem, the number of the pre&post-loop iterations is
just assumed to be the same with the unroll-factor.
A heuristic is introduced to protect against over-unrolling with
SuperWordLoopUnrollAnalysis:
- Let's assume the unroll-factor is x and the main-loop iteration is
y in the previous unrolling round, then the total iterations of
pre/main/post loops is y + x.
- In the next round, the unroll-factor becomes 2x and the main-loop
iteration is y/2, then the total iterations of pre/main/post loops is
y/2 + 2x.
- We'd better not unroll if: y/2 + 2x > y + x, that is 2x > y
----------------------------------------------------------------------------
| unroll-factor | main-lp iters | pre&post-lp iters | total iters
----------------------------------------------------------------------------
pre-round | x | y | x | y + x
----------------------------------------------------------------------------
next-round| 2x | y/2 | 2x | y/2 + 2x
----------------------------------------------------------------------------
*Testing*
No performance regression in SPECjvm2008
Any comments?
Thanks a lot.
Best regards,
Jie
On 2019/8/7 上午5:58, Vladimir Kozlov wrote:
> Hi Jie
>
> Very interesting observation. I am concern that webrev.01 does check
> for general loop which may not be vectorized. Even if your
> optimization helps in particular case it may make some loop regress
> due to executing more branches.
>
> On 7/11/19 1:20 AM, Jie Fu wrote:
>> Hi all,
>>
>> With more experiments, the loop's trip_count seems a good feature to
>> detect over loop unrolling.
>> And on some platforms, the branch-miss rate had been observed
>> increasing dramatically with small loop trip count.
>
> Why? With more unrolling you should have less number of branches.
>
>> It seems that we shouldn't unroll if the trip count becomes too small.
>
> May be there is different explanation for this. May be big loop body
> does not fit into code buffer in X86 cpu - or something like that. End
> we should watch for body size instead.
>
> Thanks,
> Vladimir
>
>>
>> I've updated the webrev here:
>> http://cr.openjdk.java.net/~jiefu/8227505/webrev.01/
>>
>> Please review it and give me some advice.
>>
>> Thanks a lot.
>> Best regards,
>> Jie
>>
>> On 2019/7/10 下午4:38, Jie Fu wrote:
>>> Hi all,
>>>
>>> JBS: https://bugs.openjdk.java.net/browse/JDK-8227505
>>> Webrev: http://cr.openjdk.java.net/~jiefu/8227505/webrev.00/
>>>
>>> The patch fix the over loop unrolling problem caused by
>>> SuperWordLoopUnrollAnalysis.
>>> For more info., please refer to the JBS.
>>>
>>> Could you please review it and give me some advice?
>>>
>>> Thanks a lot.
>>> Best regards,
>>> Jie
>>>
>>>
>>
More information about the hotspot-compiler-dev
mailing list