RFR(L): 8186027: C2: loop strip mining
Vladimir Kozlov
vladimir.kozlov at oracle.com
Wed Nov 22 19:58:09 UTC 2017
Thank you, Roland
I also ran pre-integration testing again and I see that testing with
-Xcomp took longer with this fix than without it.
I am running testing again. But if this will repeat and presence of this
Sparse.small regression suggesting to me that may be we should keep this
optimization off by default - keep UseCountedLoopSafepoints false.
We may switch it on later with additional changes which address regressions.
What do you think?
Thanks,
Vladimir
On 11/22/17 6:57 AM, Roland Westrelin wrote:
>
>> You answered all my comments.
>> I will run testing and push it if testing is clean.
>
> Thanks for the review and taking care of sponsoring.
>
> On JBS, you asked about a performance regression:
> SPECjvm2008-Sparse.small-G1 show significant regression - about 10%.
>
> The generated code looks ok to me. Inner loop with strip mining off:
>
> 0x00007f5d5b3698b0: mov 0x10(%r9,%rax,4),%r11d ;*iaload {reexecute=0 rethrow=0 return_oop=0}
> ; - spec.benchmarks.scimark.SparseCompRow::matmult at 65 (line 63)
> 0x00007f5d5b3698b5: cmp %r10d,%r11d
> 0x00007f5d5b3698b8: jae 0x00007f5d5b3698f1
> 0x00007f5d5b3698ba: vmovsd 0x10(%rcx,%rax,8),%xmm5 ;*daload {reexecute=0 rethrow=0 return_oop=0}
> ; - spec.benchmarks.scimark.SparseCompRow::matmult at 70 (line 63)
> 0x00007f5d5b3698c0: vmulsd 0x10(%r14,%r11,8),%xmm5,%xmm5
> 0x00007f5d5b3698c7: vaddsd %xmm5,%xmm4,%xmm4 ;*dadd {reexecute=0 rethrow=0 return_oop=0}
> ; - spec.benchmarks.scimark.SparseCompRow::matmult at 72 (line 63)
> 0x00007f5d5b3698cb: inc %eax ;*iinc {reexecute=0 rethrow=0 return_oop=0}
> ; - spec.benchmarks.scimark.SparseCompRow::matmult at 75 (line 62)
> 0x00007f5d5b3698cd: cmp %esi,%eax
> 0x00007f5d5b3698cf: jl 0x00007f5d5b3698b0 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
>
> Inner loop + strip mined outer loop with strip mining:
>
> 0x00007f0a02f27289: mov %ebx,%ebp
> 0x00007f0a02f2728b: sub %r9d,%ebp
> 0x00007f0a02f2728e: cmp %esi,%ebp
> 0x00007f0a02f27290: cmovg %esi,%ebp
> 0x00007f0a02f27293: add %r9d,%ebp
> 0x00007f0a02f27296: data16 nopw 0x0(%rax,%rax,1) ;*dload {reexecute=0 rethrow=0 return_oop=0}
> ; - spec.benchmarks.scimark.SparseCompRow::matmult at 57 (line 63)
> 0x00007f0a02f272a0: vmovq %xmm4,%r13
> 0x00007f0a02f272a5: mov 0x10(%r13,%r9,4),%r8d ;*iaload {reexecute=0 rethrow=0 return_oop=0}
> ; - spec.benchmarks.scimark.SparseCompRow::matmult at 65 (line 63)
> 0x00007f0a02f272aa: cmp %r11d,%r8d
> 0x00007f0a02f272ad: jae 0x00007f0a02f272df
> 0x00007f0a02f272af: vmovsd 0x10(%rcx,%r9,8),%xmm7 ;*daload {reexecute=0 rethrow=0 return_oop=0}
> ; - spec.benchmarks.scimark.SparseCompRow::matmult at 70 (line 63)
> 0x00007f0a02f272b6: vmovq %xmm2,%r13
> 0x00007f0a02f272bb: vmulsd 0x10(%r13,%r8,8),%xmm7,%xmm7
> 0x00007f0a02f272c2: vaddsd %xmm7,%xmm6,%xmm6 ;*dadd {reexecute=0 rethrow=0 return_oop=0}
> ; - spec.benchmarks.scimark.SparseCompRow::matmult at 72 (line 63)
> 0x00007f0a02f272c6: inc %r9d ;*iinc {reexecute=0 rethrow=0 return_oop=0}
> ; - spec.benchmarks.scimark.SparseCompRow::matmult at 75 (line 62)
> 0x00007f0a02f272c9: cmp %ebp,%r9d
> 0x00007f0a02f272cc: jl 0x00007f0a02f272a0 ;*goto {reexecute=0 rethrow=0 return_oop=0}
> ; - spec.benchmarks.scimark.SparseCompRow::matmult at 78 (line 62)
> 0x00007f0a02f272ce: mov 0x58(%r15),%r8 ; ImmutableOopMap{rcx=Oop rdx=Oop xmm0=Oop xmm2=Oop xmm4=Oop [0]=Oop }
> ;*goto {reexecute=1 rethrow=0 return_oop=0}
> ; - spec.benchmarks.scimark.SparseCompRow::matmult at 78 (line 62)
> 0x00007f0a02f272d2: test %eax,(%r8) ;*goto {reexecute=0 rethrow=0 return_oop=0}
> ; - spec.benchmarks.scimark.SparseCompRow::matmult at 78 (line 62)
> ; {poll}
> 0x00007f0a02f272d5: cmp %ebx,%r9d
> 0x00007f0a02f272d8: jl 0x00007f0a02f27289
>
> There are extra spill instructions for some reason but I think the
> bigger problem here is that the inner loop runs for a small number of
> iterations (~2) and so the outer strip mined loop is executed too
> often. I have an experimental patch that, for loop executed a small
> number of iterations (from profiling), makes 2 copies: one with the
> outer strip mined loop, one without. The code picks one or the other at
> runtime based on the actual number of iterations so there's still an
> overhead. I didn't include it here because it doesn't make the
> regression go away entirely and it's extra complexity.
>
> Roland.
>
More information about the hotspot-compiler-dev
mailing list