RFR (XS): Enable UseCountedLoopSafepoints with Shenandoah

Tue Dec 20 17:56:17 UTC 2016

On 12/20/2016 05:52 PM, Andrew Haley wrote:
> On 20/12/16 14:01, Roman Kennke wrote:
>> Am Dienstag, den 20.12.2016, 13:32 +0000 schrieb Andrew Haley:
>>> On 20/12/16 10:57, Aleksey Shipilev wrote:
>>>> Since we care mostly about pause times, and not the raw throughput,
>>>> it makes
>>>> sense to enable safepoints in counted loops. This makes us much
>>>> more responsive
>>>> (as in, TTSP is lower) in many interesting scenarios.
>>>
>>> True, but I have seen some very interesting cases where we beat G1 in
>>> throughput.
>>
>> Yes. As far as I can see, those are not affected by this (e.g. compiler
>> benchmarks). And multiple seconds (!) just to get to a safepoint seems
>> way too much, and it's more than 1 program that is affected by this.
> 
> Can you tell me which program delays so long?  I'd like to see it.
> 
> I suspect that's a bug.  And, of course, people are capable of using
> -XX:-UseCountedLoopSafepoints themselves.

This is not a bug, it is a very known Hotspot issue:
  http://psy-lob-saw.blogspot.de/2015/12/safepoints.html
  http://psy-lob-saw.blogspot.de/2016/02/wait-for-it-counteduncounted-loops.html

If you want a contrived example, here's one:

http://icedtea.classpath.org/people/shade/gc-bench/file/5b77fb55a8b6/src/main/java/org/openjdk/gcbench/yield/ArrayIteration.java

With 100M array, on my high-end i7 we have 300ms TTSP, which completely
dominates Shenandoah pause time. With safepoints in the loop TTSP is down to 1-5ms.

Another one:
 http://icedtea.classpath.org/people/shade/gc-bench/file/4c32eb6c67b0/src/main/java/org/openjdk/gcbench/yield/MonteCarloPI.java

With 100M samples one MonteCarlo run takes 1s, and that's the TTSP on my desktop
as well. With safepoints in the loop TTSP is down to 1-5ms.

Another one:

http://icedtea.classpath.org/people/shade/gc-bench/file/5b77fb55a8b6/src/main/java/org/openjdk/gcbench/fragger/LinkedListFragger.java

If you do LinkedList.get(index), it does counted loop inside for stepping
r->r.next N times. But since the whole thing is cache-hostile, you have a
problem. On large machine with 32 slow cores and slow memory TTSPs are in 1+
second range.

This completely blows "ultra low pause" targets.

There is an alternative solution: loop mining, i.e. replacing one big loop with
two nested loops, and safepointing the outer one. This requires heavy changes in
C2. Roland wanted to take on this after the Xmas break.

Thanks,
-Aleksey