RFR(L): 8186027: C2: loop strip mining

Wed Nov 22 19:58:09 UTC 2017

Thank you, Roland

I also ran pre-integration testing again and I see that testing with 
-Xcomp took longer with this fix than without it.

I am running testing again. But if this will repeat and presence of this 
Sparse.small regression suggesting to me that may be we should keep this 
optimization off by default - keep UseCountedLoopSafepoints false.

We may switch it on later with additional changes which address regressions.

What do you think?

Thanks,
Vladimir

On 11/22/17 6:57 AM, Roland Westrelin wrote:
> 
>> You answered all my comments.
>> I will run testing and push it if testing is clean.
> 
> Thanks for the review and taking care of sponsoring.
> 
> On JBS, you asked about a performance regression:
> SPECjvm2008-Sparse.small-G1 show significant regression - about 10%.
> 
> The generated code looks ok to me. Inner loop with strip mining off:
> 
>   0x00007f5d5b3698b0: mov    0x10(%r9,%rax,4),%r11d  ;*iaload {reexecute=0 rethrow=0 return_oop=0}
>                                                 ; - spec.benchmarks.scimark.SparseCompRow::matmult at 65 (line 63)
>   0x00007f5d5b3698b5: cmp    %r10d,%r11d
>   0x00007f5d5b3698b8: jae    0x00007f5d5b3698f1
>   0x00007f5d5b3698ba: vmovsd 0x10(%rcx,%rax,8),%xmm5  ;*daload {reexecute=0 rethrow=0 return_oop=0}
>                                                 ; - spec.benchmarks.scimark.SparseCompRow::matmult at 70 (line 63)
>   0x00007f5d5b3698c0: vmulsd 0x10(%r14,%r11,8),%xmm5,%xmm5
>   0x00007f5d5b3698c7: vaddsd %xmm5,%xmm4,%xmm4  ;*dadd {reexecute=0 rethrow=0 return_oop=0}
>                                                 ; - spec.benchmarks.scimark.SparseCompRow::matmult at 72 (line 63)
>   0x00007f5d5b3698cb: inc    %eax               ;*iinc {reexecute=0 rethrow=0 return_oop=0}
>                                                 ; - spec.benchmarks.scimark.SparseCompRow::matmult at 75 (line 62)
>   0x00007f5d5b3698cd: cmp    %esi,%eax
>   0x00007f5d5b3698cf: jl     0x00007f5d5b3698b0  ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
> 
> Inner loop + strip mined outer loop with strip mining:
> 
>   0x00007f0a02f27289: mov    %ebx,%ebp
>   0x00007f0a02f2728b: sub    %r9d,%ebp
>   0x00007f0a02f2728e: cmp    %esi,%ebp
>   0x00007f0a02f27290: cmovg  %esi,%ebp
>   0x00007f0a02f27293: add    %r9d,%ebp
>   0x00007f0a02f27296: data16 nopw 0x0(%rax,%rax,1)  ;*dload {reexecute=0 rethrow=0 return_oop=0}
>                                                 ; - spec.benchmarks.scimark.SparseCompRow::matmult at 57 (line 63)
>   0x00007f0a02f272a0: vmovq  %xmm4,%r13
>   0x00007f0a02f272a5: mov    0x10(%r13,%r9,4),%r8d  ;*iaload {reexecute=0 rethrow=0 return_oop=0}
>                                                 ; - spec.benchmarks.scimark.SparseCompRow::matmult at 65 (line 63)
>   0x00007f0a02f272aa: cmp    %r11d,%r8d
>   0x00007f0a02f272ad: jae    0x00007f0a02f272df
>   0x00007f0a02f272af: vmovsd 0x10(%rcx,%r9,8),%xmm7  ;*daload {reexecute=0 rethrow=0 return_oop=0}
>                                                 ; - spec.benchmarks.scimark.SparseCompRow::matmult at 70 (line 63)
>   0x00007f0a02f272b6: vmovq  %xmm2,%r13
>   0x00007f0a02f272bb: vmulsd 0x10(%r13,%r8,8),%xmm7,%xmm7
>   0x00007f0a02f272c2: vaddsd %xmm7,%xmm6,%xmm6  ;*dadd {reexecute=0 rethrow=0 return_oop=0}
>                                                 ; - spec.benchmarks.scimark.SparseCompRow::matmult at 72 (line 63)
>   0x00007f0a02f272c6: inc    %r9d               ;*iinc {reexecute=0 rethrow=0 return_oop=0}
>                                                 ; - spec.benchmarks.scimark.SparseCompRow::matmult at 75 (line 62)
>   0x00007f0a02f272c9: cmp    %ebp,%r9d
>   0x00007f0a02f272cc: jl     0x00007f0a02f272a0  ;*goto {reexecute=0 rethrow=0 return_oop=0}
>                                                 ; - spec.benchmarks.scimark.SparseCompRow::matmult at 78 (line 62)
>   0x00007f0a02f272ce: mov    0x58(%r15),%r8     ; ImmutableOopMap{rcx=Oop rdx=Oop xmm0=Oop xmm2=Oop xmm4=Oop [0]=Oop }
>                                                 ;*goto {reexecute=1 rethrow=0 return_oop=0}
>                                                 ; - spec.benchmarks.scimark.SparseCompRow::matmult at 78 (line 62)
>   0x00007f0a02f272d2: test   %eax,(%r8)         ;*goto {reexecute=0 rethrow=0 return_oop=0}
>                                                 ; - spec.benchmarks.scimark.SparseCompRow::matmult at 78 (line 62)
>                                                 ;   {poll}
>   0x00007f0a02f272d5: cmp    %ebx,%r9d
>   0x00007f0a02f272d8: jl     0x00007f0a02f27289
> 
> There are extra spill instructions for some reason but I think the
> bigger problem here is that the inner loop runs for a small number of
> iterations (~2) and so the outer strip mined loop is executed too
> often. I have an experimental patch that, for loop executed a small
> number of iterations (from profiling), makes 2 copies: one with the
> outer strip mined loop, one without. The code picks one or the other at
> runtime based on the actual number of iterations so there's still an
> overhead. I didn't include it here because it doesn't make the
> regression go away entirely and it's extra complexity.
> 
> Roland.
>