RFR(L): 8186027: C2: loop strip mining
Roland Westrelin
rwestrel at redhat.com
Wed Nov 22 14:57:55 UTC 2017
> You answered all my comments.
> I will run testing and push it if testing is clean.
Thanks for the review and taking care of sponsoring.
On JBS, you asked about a performance regression:
SPECjvm2008-Sparse.small-G1 show significant regression - about 10%.
The generated code looks ok to me. Inner loop with strip mining off:
0x00007f5d5b3698b0: mov 0x10(%r9,%rax,4),%r11d ;*iaload {reexecute=0 rethrow=0 return_oop=0}
; - spec.benchmarks.scimark.SparseCompRow::matmult at 65 (line 63)
0x00007f5d5b3698b5: cmp %r10d,%r11d
0x00007f5d5b3698b8: jae 0x00007f5d5b3698f1
0x00007f5d5b3698ba: vmovsd 0x10(%rcx,%rax,8),%xmm5 ;*daload {reexecute=0 rethrow=0 return_oop=0}
; - spec.benchmarks.scimark.SparseCompRow::matmult at 70 (line 63)
0x00007f5d5b3698c0: vmulsd 0x10(%r14,%r11,8),%xmm5,%xmm5
0x00007f5d5b3698c7: vaddsd %xmm5,%xmm4,%xmm4 ;*dadd {reexecute=0 rethrow=0 return_oop=0}
; - spec.benchmarks.scimark.SparseCompRow::matmult at 72 (line 63)
0x00007f5d5b3698cb: inc %eax ;*iinc {reexecute=0 rethrow=0 return_oop=0}
; - spec.benchmarks.scimark.SparseCompRow::matmult at 75 (line 62)
0x00007f5d5b3698cd: cmp %esi,%eax
0x00007f5d5b3698cf: jl 0x00007f5d5b3698b0 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
Inner loop + strip mined outer loop with strip mining:
0x00007f0a02f27289: mov %ebx,%ebp
0x00007f0a02f2728b: sub %r9d,%ebp
0x00007f0a02f2728e: cmp %esi,%ebp
0x00007f0a02f27290: cmovg %esi,%ebp
0x00007f0a02f27293: add %r9d,%ebp
0x00007f0a02f27296: data16 nopw 0x0(%rax,%rax,1) ;*dload {reexecute=0 rethrow=0 return_oop=0}
; - spec.benchmarks.scimark.SparseCompRow::matmult at 57 (line 63)
0x00007f0a02f272a0: vmovq %xmm4,%r13
0x00007f0a02f272a5: mov 0x10(%r13,%r9,4),%r8d ;*iaload {reexecute=0 rethrow=0 return_oop=0}
; - spec.benchmarks.scimark.SparseCompRow::matmult at 65 (line 63)
0x00007f0a02f272aa: cmp %r11d,%r8d
0x00007f0a02f272ad: jae 0x00007f0a02f272df
0x00007f0a02f272af: vmovsd 0x10(%rcx,%r9,8),%xmm7 ;*daload {reexecute=0 rethrow=0 return_oop=0}
; - spec.benchmarks.scimark.SparseCompRow::matmult at 70 (line 63)
0x00007f0a02f272b6: vmovq %xmm2,%r13
0x00007f0a02f272bb: vmulsd 0x10(%r13,%r8,8),%xmm7,%xmm7
0x00007f0a02f272c2: vaddsd %xmm7,%xmm6,%xmm6 ;*dadd {reexecute=0 rethrow=0 return_oop=0}
; - spec.benchmarks.scimark.SparseCompRow::matmult at 72 (line 63)
0x00007f0a02f272c6: inc %r9d ;*iinc {reexecute=0 rethrow=0 return_oop=0}
; - spec.benchmarks.scimark.SparseCompRow::matmult at 75 (line 62)
0x00007f0a02f272c9: cmp %ebp,%r9d
0x00007f0a02f272cc: jl 0x00007f0a02f272a0 ;*goto {reexecute=0 rethrow=0 return_oop=0}
; - spec.benchmarks.scimark.SparseCompRow::matmult at 78 (line 62)
0x00007f0a02f272ce: mov 0x58(%r15),%r8 ; ImmutableOopMap{rcx=Oop rdx=Oop xmm0=Oop xmm2=Oop xmm4=Oop [0]=Oop }
;*goto {reexecute=1 rethrow=0 return_oop=0}
; - spec.benchmarks.scimark.SparseCompRow::matmult at 78 (line 62)
0x00007f0a02f272d2: test %eax,(%r8) ;*goto {reexecute=0 rethrow=0 return_oop=0}
; - spec.benchmarks.scimark.SparseCompRow::matmult at 78 (line 62)
; {poll}
0x00007f0a02f272d5: cmp %ebx,%r9d
0x00007f0a02f272d8: jl 0x00007f0a02f27289
There are extra spill instructions for some reason but I think the
bigger problem here is that the inner loop runs for a small number of
iterations (~2) and so the outer strip mined loop is executed too
often. I have an experimental patch that, for loop executed a small
number of iterations (from profiling), makes 2 copies: one with the
outer strip mined loop, one without. The code picks one or the other at
runtime based on the actual number of iterations so there's still an
overhead. I didn't include it here because it doesn't make the
regression go away entirely and it's extra complexity.
Roland.
More information about the hotspot-compiler-dev
mailing list