RFR (XS): Enable UseCountedLoopSafepoints with Shenandoah

Tue Dec 20 11:02:11 UTC 2016

Am Dienstag, den 20.12.2016, 11:57 +0100 schrieb Aleksey Shipilev:
> Hi,
> 
> Since we care mostly about pause times, and not the raw throughput,
> it makes
> sense to enable safepoints in counted loops. This makes us much more
> responsive
> (as in, TTSP is lower) in many interesting scenarios.
> 
> Change:
>   http://cr.openjdk.java.net/~shade/shenandoah/counted-loops/webrev.0
> 1/
> 
> The easiest example that is present in any workload of interest is
> looping
> through a large array/ArrayList.
> 
> SPECjvm2008 throughput does appear affected where tight loops are
> present:
> 
> Benchmark                Mode  Cnt      Score    Error    Units
> 
> # -XX:-UseCountedLoopSafepoints
> Compiler.compiler       thrpt   30    217.169 ±  5.166  ops/min
> Compiler.sunflow        thrpt   30    473.940 ± 20.246  ops/min
> Compress.test           thrpt   15    647.552 ±  3.528  ops/min
> CryptoAes.test          thrpt   15     44.367 ±  2.402  ops/min
> CryptoRsa.test          thrpt   15   2066.495 ± 11.809  ops/min
> CryptoSignVerify.test   thrpt   15  10372.019 ± 50.713  ops/min
> Derby.test              thrpt   30    375.954 ± 13.539  ops/min
> MpegAudio.test          thrpt   15    197.299 ±  2.411  ops/min
> ScimarkFFT.large        thrpt   15     55.618 ±  0.142  ops/min
> ScimarkFFT.small        thrpt   15    664.370 ±  7.304  ops/min
> ScimarkLU.large         thrpt   15     14.767 ±  0.082  ops/min
> ScimarkLU.small         thrpt   15    926.435 ±  8.790  ops/min
> ScimarkMonteCarlo.test  thrpt   15   4508.333 ± 68.869  ops/min
> ScimarkSOR.large        thrpt   15     74.596 ±  0.052  ops/min
> ScimarkSOR.small        thrpt   15    466.186 ±  1.308  ops/min
> ScimarkSparse.large     thrpt   15     48.932 ± 11.991  ops/min
> ScimarkSparse.small     thrpt   15    360.907 ±  6.739  ops/min
> Serial.test             thrpt   30   8779.857 ± 77.717    ops/s
> Sunflow.test            thrpt   15    124.546 ±  2.110  ops/min
> XmlTransform.test       thrpt   20    429.422 ± 24.964  ops/min
> XmlValidation.test      thrpt   30    773.254 ±  8.561  ops/min
> 
> # -XX:+UseCountedLoopSafepoints
> Compiler.compiler       thrpt   20    213.199 ±  8.146  ops/min
> Compiler.sunflow        thrpt   27    486.745 ± 21.118  ops/min
> Compress.test           thrpt   15    637.303 ±  4.800  ops/min <
> ---  -1.5%
> CryptoAes.test          thrpt   15     46.943 ±  0.345  ops/min
> CryptoRsa.test          thrpt   15   2042.072 ± 12.379  ops/min <
> ---  -1.1%
> CryptoSignVerify.test   thrpt   15  10240.459 ± 63.095  ops/min
> Derby.test              thrpt   30    406.943 ± 12.625  ops/min
> MpegAudio.test          thrpt   15    193.173 ±  1.414  ops/min
> ScimarkFFT.large        thrpt   15     55.629 ±  0.104  ops/min
> ScimarkFFT.small        thrpt   15    669.153 ±  6.683  ops/min
> ScimarkLU.large         thrpt   15     13.510 ±  0.075  ops/min <
> ---  -8.5%
> ScimarkLU.small         thrpt   15    581.737 ±  6.539  ops/min <---
> -37.3%
> ScimarkMonteCarlo.test  thrpt   15   4485.049 ± 11.864  ops/min
> ScimarkSOR.large        thrpt   15     74.594 ±  0.045  ops/min
> ScimarkSOR.small        thrpt   15    421.046 ±  0.456  ops/min <
> ---  -9.6%
> ScimarkSparse.large     thrpt   15     40.995 ±  0.283  ops/min
> ScimarkSparse.small     thrpt   15    319.079 ±  1.391  ops/min <---
> -11.3%
> Serial.test             thrpt   30   8717.823 ± 81.147    ops/s
> Sunflow.test            thrpt   15    127.221 ±  1.578  ops/min
> XmlTransform.test       thrpt   20    445.762 ±  8.278  ops/min
> XmlValidation.test      thrpt   30    760.121 ±  9.963  ops/min
> 
> Note that Scimark are expected to regress that much: they do have
> very tight
> loops, and that's our problem: the TTSP times there are in multi-
> second range!
> The difference is explained by different code generation. For
> example, in most
> dramatic ScimarkLU.small case:
> 
> Hottest loop uses AVX2 (vmovdqu and friends):
> 
> http://cr.openjdk.java.net/~shade/shenandoah/counted-loops/scimark-lu
> -shenandoah-minus.perfasm
> 
> Hottest loop uses AVX (vmovsd and friends):
> 
> http://cr.openjdk.java.net/~shade/shenandoah/counted-loops/scimark-lu
> -shenandoah-plus.perfasm
> 
> As such, I believe enabling this by default, and figuring out code
> quality
> issues as we go forward is the sane tactics.

Yes. The regressions, especially in scimark.lu are bad, but as you say,
the ones that regress are also the ones that show extreme TTSP.

The patch is ok for me. Folks who prefer raw throughput and can live
with multisecond pause times can still turn the option off :-)

In the long run, we should look at strip mining the loops.

Roman