RFR: 8227505: SuperWordLoopUnrollAnalysis may lead to over loop unrolling

Tue Aug 20 03:44:30 UTC 2019

Hi All

I tested this patch with small test which adds byte arrays.
for (int i = 0; i < NUM; i++) {
                data[i] = (byte)(data2[i] + data3[i]);
}

Since the loop unrolled to half than earlier, the maximum vector length could not be used and generated 256 bit long vector instructions instead of maximum available 512 bits. Also the loop did not get unrolled after vectorization.  I have given the generated code below.

Please find the small test attached with the mail.

Rgards,
Vivek

Previous code:
0x00007f833c8691d6:   vmovdqu32 0x10(%rsi,%r14,1),%zmm3
  0x00007f833c8691e1:   vpaddb 0x10(%rdx,%r14,1),%zmm3,%zmm3
  0x00007f833c8691ec:   vmovdqu32 %zmm3,0x10(%rcx,%r14,1)
  0x00007f833c8691f7:   vmovdqu32 0x50(%rsi,%rbp,1),%zmm3
  0x00007f833c869202:   vpaddb 0x50(%rdx,%rbp,1),%zmm3,%zmm3
  0x00007f833c86920d:   vmovdqu32 %zmm3,0x50(%rcx,%rbp,1)
  0x00007f833c869218:   vmovdqu32 0x90(%rsi,%rbp,1),%zmm3
  0x00007f833c869223:   vpaddb 0x90(%rdx,%rbp,1),%zmm3,%zmm3
  0x00007f833c86922e:   vmovdqu32 %zmm3,0x90(%rcx,%rbp,1)
  0x00007f833c869239:   vmovdqu32 0xd0(%rsi,%rbp,1),%zmm3
  0x00007f833c869244:   vpaddb 0xd0(%rdx,%rbp,1),%zmm3,%zmm3
  0x00007f833c86924f:   vmovdqu32 %zmm3,0xd0(%rcx,%rbp,1)

After applying the patch:
0x00007f833c8660a3:   vmovdqu 0x10(%rdi,%r11,1),%ymm0
  0x00007f833c8660aa:   vpaddb 0x10(%rsi,%r11,1),%ymm0,%ymm0
  0x00007f833c8660b1:   vmovdqu %ymm0,0x10(%rdx,%r11,1)

-----Original Message-----
From: Jie Fu [mailto:fujie at loongson.cn] 
Sent: Monday, August 12, 2019 2:27 AM
To: Vladimir Kozlov <vladimir.kozlov at oracle.com>; hotspot-compiler-dev at openjdk.java.net
Cc: Deshpande, Vivek R <vivek.r.deshpande at intel.com>
Subject: Re: RFR: 8227505: SuperWordLoopUnrollAnalysis may lead to over loop unrolling

Hi Vladimir and all,

Updated: http://cr.openjdk.java.net/~jiefu/8227505/webrev.02/

*Analysis*
The performance drop is caused by the over loop unrolling of SuperWordLoopUnrollAnalysis, which do not consider the negative effect of pre/post-loop at all.

The following is the perf stat data for different loop unrolling factors.
Please note that the number of branches increased by ~47% (from
19,854,995,391 to 29,229,714,991) when the unroll-factor increased from
8 to 16.
And the total instructions increased by ~58% (from 108,849,151,185 to 171,334,733,346), which was even worse.

perf stat for unroll-factor=8:
----------------------------------------------------------------------------
        5429.117030      task-clock (msec)         #    1.006 CPUs utilized
                620      context-switches          #    0.114 K/sec
                 11      cpu-migrations            #    0.002 K/sec
             41,905      page-faults               #    0.008 M/sec
     24,176,919,686      cycles                    #    4.453 GHz
    108,849,151,185      instructions              #    4.50  insn per cycle
     19,854,995,391      branches                  # 3657.132 M/sec
         17,788,819      branch-misses             #    0.09% of all branches

        5.396099347 seconds time elapsed
----------------------------------------------------------------------------

perf stat for unroll-factor=16:
----------------------------------------------------------------------------
        9158.323771      task-clock (msec)         #    1.005 CPUs utilized
                763      context-switches          #    0.083 K/sec
                 16      cpu-migrations            #    0.002 K/sec
             41,884      page-faults               #    0.005 M/sec
     40,831,102,837      cycles                    #    4.458 GHz
    171,334,733,346      instructions              #    4.20  insn per cycle
     29,229,714,991      branches                  # 3191.601 M/sec
         16,455,010      branch-misses             #    0.06% of all branches

        9.115309970 seconds time elapsed
----------------------------------------------------------------------------

The increment of branches and total instructions was mainly introduced by the pre- and post-loop.
1) Higher unroll-factor may lead to more iterations in the pre-loop due to the alignment requirement in the main-loop.
    For example, with unroll-factor=16, 16 iterations may be required in the pre-loop since 16-byte vector instructions were used in the main-loop.
    However, no more than 8 iterations when unroll-factor=8.
2) Higher unroll-factor may lead to more iterations in the post-loop since the range of it is [0, unroll-factor - 1).

As for the particular case, the distribution of iterations for the
pre-/main-/post- loops seem to be:
-----------------------------------------------------------------------
           | pre-lp iters | main-lp iters | post-lp iters | total iters
-----------------------------------------------------------------------
unroll(8) |      8       |       6       |        5      |     19
-----------------------------------------------------------------------
unroll(16)|     16       |       2       |       13      |     31
-----------------------------------------------------------------------
So it's harmful to unroll with 16.

*Fix*
The loop body size seems unable to detect this case.
When the VM tries to decide whether to unroll 16, the loop body size is 
just 64, which seems quite reasonable for SuperWordLoopUnrollAnalysis.
And the generated loop body in assembly is small enough with unroll 16.
----------------------------------------------------------------------------
  ;; B18: #      out( B18 B19 ) <- in( B17 B18 ) Loop( B18-B18 inner 
main of N59) Freq: 891835
   0x00007f62187bfba0:   movslq %r8d,%r11
   0x00007f62187bfba3:   vmovdqu 0x10(%r9,%r11,1),%xmm0
   0x00007f62187bfbaa:   vmovdqu %xmm0,0x10(%rsi,%r11,1) ;*bastore 
{reexecute=0 rethrow=0 return_oop=0}
                                                             ; - 
TestSuperWordOverunrolling::execute at 56 (line 21)
   0x00007f62187bfbb1:   add    $0x10,%r8d                   ;*iinc 
{reexecute=0 rethrow=0 return_oop=0}
                                                             ; - 
TestSuperWordOverunrolling::execute at 57 (line 20)
   0x00007f62187bfbb5:   cmp    $0x2f,%r8d
   0x00007f62187bfbb9:   jl     0x00007f62187bfba0 ;*if_icmpge 
{reexecute=0 rethrow=0 return_oop=0}
                                                             ; - 
TestSuperWordOverunrolling::execute at 38 (line 20)
----------------------------------------------------------------------------

To fix it, the possible negative effect of pre-/post-loop should be 
considered.
The unrolling may increase the performance if the total iterations of 
pre/main/post loops could be decreased.
However, the precise number of iterations in the pre-/post-loop is 
really hard to predict since it depends on many factors such as align 
requirement, number of data&type, object layout, and allocated addresses.

To simplify the problem, the number of the pre&post-loop iterations is 
just assumed to be the same with the unroll-factor.
A heuristic is introduced to protect against over-unrolling with 
SuperWordLoopUnrollAnalysis:
   - Let's assume the unroll-factor is x and the main-loop iteration is 
y in the previous unrolling round, then the total iterations of 
pre/main/post loops is y + x.
   - In the next round, the unroll-factor becomes 2x and the main-loop 
iteration is y/2, then the total iterations of pre/main/post loops is 
y/2 + 2x.
   - We'd better not unroll if: y/2 + 2x  > y + x, that is 2x > y
----------------------------------------------------------------------------
           | unroll-factor | main-lp iters | pre&post-lp iters | total iters
----------------------------------------------------------------------------
pre-round |       x       |       y       |        x          | y + x
----------------------------------------------------------------------------
next-round|      2x       |      y/2      |       2x          | y/2 + 2x
----------------------------------------------------------------------------

*Testing*
No performance regression in SPECjvm2008

Any comments?

Thanks a lot.
Best regards,
Jie

On 2019/8/7 上午5:58, Vladimir Kozlov wrote:
> Hi Jie
>
> Very interesting observation. I am concern that webrev.01 does check 
> for general loop which may not be vectorized. Even if your 
> optimization helps in particular case it may make some loop regress 
> due to executing more branches.
>
> On 7/11/19 1:20 AM, Jie Fu wrote:
>> Hi all,
>>
>> With more experiments, the loop's trip_count seems a good feature to 
>> detect over loop unrolling.
>> And on some platforms, the branch-miss rate had been observed 
>> increasing dramatically with small loop trip count.
>
> Why? With more unrolling you should have less number of branches.
>
>> It seems that we shouldn't unroll if the trip count becomes too small.
>
> May be there is different explanation for this. May be big loop body 
> does not fit into code buffer in X86 cpu - or something like that. End 
> we should watch for body size instead.
>
> Thanks,
> Vladimir
>
>>
>> I've updated the webrev here: 
>> http://cr.openjdk.java.net/~jiefu/8227505/webrev.01/
>>
>> Please review it and give me some advice.
>>
>> Thanks a lot.
>> Best regards,
>> Jie
>>
>> On 2019/7/10 下午4:38, Jie Fu wrote:
>>> Hi all,
>>>
>>> JBS:    https://bugs.openjdk.java.net/browse/JDK-8227505
>>> Webrev: http://cr.openjdk.java.net/~jiefu/8227505/webrev.00/
>>>
>>> The patch fix the over loop unrolling problem caused by 
>>> SuperWordLoopUnrollAnalysis.
>>> For more info., please refer to the JBS.
>>>
>>> Could you please review it and give me some advice?
>>>
>>> Thanks a lot.
>>> Best regards,
>>> Jie
>>>
>>>
>>