RFR(L): 8186027: C2: loop strip mining

Wed Nov 22 14:57:55 UTC 2017

> You answered all my comments.
> I will run testing and push it if testing is clean.

Thanks for the review and taking care of sponsoring.

On JBS, you asked about a performance regression:
SPECjvm2008-Sparse.small-G1 show significant regression - about 10%.

The generated code looks ok to me. Inner loop with strip mining off:

 0x00007f5d5b3698b0: mov    0x10(%r9,%rax,4),%r11d  ;*iaload {reexecute=0 rethrow=0 return_oop=0}
                                               ; - spec.benchmarks.scimark.SparseCompRow::matmult at 65 (line 63)
 0x00007f5d5b3698b5: cmp    %r10d,%r11d
 0x00007f5d5b3698b8: jae    0x00007f5d5b3698f1
 0x00007f5d5b3698ba: vmovsd 0x10(%rcx,%rax,8),%xmm5  ;*daload {reexecute=0 rethrow=0 return_oop=0}
                                               ; - spec.benchmarks.scimark.SparseCompRow::matmult at 70 (line 63)
 0x00007f5d5b3698c0: vmulsd 0x10(%r14,%r11,8),%xmm5,%xmm5
 0x00007f5d5b3698c7: vaddsd %xmm5,%xmm4,%xmm4  ;*dadd {reexecute=0 rethrow=0 return_oop=0}
                                               ; - spec.benchmarks.scimark.SparseCompRow::matmult at 72 (line 63)
 0x00007f5d5b3698cb: inc    %eax               ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                               ; - spec.benchmarks.scimark.SparseCompRow::matmult at 75 (line 62)
 0x00007f5d5b3698cd: cmp    %esi,%eax
 0x00007f5d5b3698cf: jl     0x00007f5d5b3698b0  ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}

Inner loop + strip mined outer loop with strip mining:

 0x00007f0a02f27289: mov    %ebx,%ebp
 0x00007f0a02f2728b: sub    %r9d,%ebp
 0x00007f0a02f2728e: cmp    %esi,%ebp
 0x00007f0a02f27290: cmovg  %esi,%ebp
 0x00007f0a02f27293: add    %r9d,%ebp
 0x00007f0a02f27296: data16 nopw 0x0(%rax,%rax,1)  ;*dload {reexecute=0 rethrow=0 return_oop=0}
                                               ; - spec.benchmarks.scimark.SparseCompRow::matmult at 57 (line 63)
 0x00007f0a02f272a0: vmovq  %xmm4,%r13
 0x00007f0a02f272a5: mov    0x10(%r13,%r9,4),%r8d  ;*iaload {reexecute=0 rethrow=0 return_oop=0}
                                               ; - spec.benchmarks.scimark.SparseCompRow::matmult at 65 (line 63)
 0x00007f0a02f272aa: cmp    %r11d,%r8d
 0x00007f0a02f272ad: jae    0x00007f0a02f272df
 0x00007f0a02f272af: vmovsd 0x10(%rcx,%r9,8),%xmm7  ;*daload {reexecute=0 rethrow=0 return_oop=0}
                                               ; - spec.benchmarks.scimark.SparseCompRow::matmult at 70 (line 63)
 0x00007f0a02f272b6: vmovq  %xmm2,%r13
 0x00007f0a02f272bb: vmulsd 0x10(%r13,%r8,8),%xmm7,%xmm7
 0x00007f0a02f272c2: vaddsd %xmm7,%xmm6,%xmm6  ;*dadd {reexecute=0 rethrow=0 return_oop=0}
                                               ; - spec.benchmarks.scimark.SparseCompRow::matmult at 72 (line 63)
 0x00007f0a02f272c6: inc    %r9d               ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                               ; - spec.benchmarks.scimark.SparseCompRow::matmult at 75 (line 62)
 0x00007f0a02f272c9: cmp    %ebp,%r9d
 0x00007f0a02f272cc: jl     0x00007f0a02f272a0  ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                               ; - spec.benchmarks.scimark.SparseCompRow::matmult at 78 (line 62)
 0x00007f0a02f272ce: mov    0x58(%r15),%r8     ; ImmutableOopMap{rcx=Oop rdx=Oop xmm0=Oop xmm2=Oop xmm4=Oop [0]=Oop }
                                               ;*goto {reexecute=1 rethrow=0 return_oop=0}
                                               ; - spec.benchmarks.scimark.SparseCompRow::matmult at 78 (line 62)
 0x00007f0a02f272d2: test   %eax,(%r8)         ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                               ; - spec.benchmarks.scimark.SparseCompRow::matmult at 78 (line 62)
                                               ;   {poll}
 0x00007f0a02f272d5: cmp    %ebx,%r9d
 0x00007f0a02f272d8: jl     0x00007f0a02f27289

There are extra spill instructions for some reason but I think the
bigger problem here is that the inner loop runs for a small number of
iterations (~2) and so the outer strip mined loop is executed too
often. I have an experimental patch that, for loop executed a small
number of iterations (from profiling), makes 2 copies: one with the
outer strip mined loop, one without. The code picks one or the other at
runtime based on the actual number of iterations so there's still an
overhead. I didn't include it here because it doesn't make the
regression go away entirely and it's extra complexity.

Roland.