run-to-run variance on C/P/N/Q experiments

Tue Oct 9 03:41:56 PDT 2012

Yes, turned off.

In fact, the configuration is the same as [1], namely:
decomposition benchmark on 2x8x2 Xeon E5-2680 (SandyBridge) running
Solaris 11, and 20120925 lambda nightly with -d64 -XX:-TieredCompilation
-XX:+UseParallelOldGC -XX:+UseNUMA -XX:-UseBiasedLocking
-XX:+UseCondCardMark.

And the difference is sustainable throughout the run (even though that
could be explained with C1 getting different profiles in tiered mode,
which is not enabled in this particular case).

-Aleksey.

On 10/09/2012 02:34 PM, Remi Forax wrote:
> Aleksey,
> is it with tiered compilation enable or not ?
> 
> I've found that tiered compilation introduces more jitter than when the 
> VM is configured to only c2.
> 
> Rémi
> 
> On 10/09/2012 11:18 AM, Aleksey Shipilev wrote:
>> Hi,
>>
>> I'm following up on the decomposition experiments, and this time focus
>> on run to run variance for these. I've took one of the break-even points
>> of the previous experiment on the same machine [1], and executed it
>> multiple times.
>>
>> For C=1, P=32, N=3000, Q=20 in parallel case, we run the tests in two modes:
>>    a. 10 iterations per JVM invocation, 1000 JVM runs [2]
>>    b. 100 iterations per JVM invocation, 10 JVM runs [3]
>>
>> The bottom line for this experiment is that we experience a huge
>> run-to-run variance, that are be triaged to be JITting jitter:
>>    - scores drift from run to run, staying within the bounds in the run
>>    - -Xint mitigates the variance (with a huge penalty in scores)
>>    - -Xcomp -Xbatch mitigates the variance (but drops the scores)
>>
>> That also means that our break-even experiments are somewhat 30-50% off
>> the true value. There is no reasonable way found to lower the run-to-run
>> variance without the performance penalty, so we only option left at this
>> point is run with multiple invocations.
>>
>> The disassembly dumps caught for low-score and high-score are here [4].
>> The integer there is the throughput we have on that code. If someone
>> could make sense of those logs alone, you are welcome to do so. The
>> entry point for microbenchmark is "testParallel" method. The inline
>> trees are somewhat different, but not that different to readily explain
>> the performance difference.
>>
>> -Aleksey.
>>
>> [1]
>> http://mail.openjdk.java.net/pipermail/lambda-dev/2012-October/006088.html
>> [2] http://shipilev.net/pub/jdk/lambda/runtorun-variance/i10-f1000/
>> [3] http://shipilev.net/pub/jdk/lambda/runtorun-variance/i100-f10/
>> [4] http://shipilev.net/pub/jdk/lambda/runtorun-variance/i10-f1000/asms/
>>
> 
>