RFR(XL): 8203469: Faster safepoints

Wed Jan 16 09:45:50 UTC 2019

Thanks Aleksey!

Great that you are seeing such improvement and thanks for the build test.

I added a new logging line in this patch, e.g. :
[93.734s][info][safepoint,stats] Deoptimize: Synchronization: 4899 Operation: 
457865 - Total: 462764 Application: 8270801 (ns)

Which might be useful (it is for me), let me know if it's needs reformatting or 
is missing information etc.. or useless.

We also don't measure the early prolog with:
  380   Universe::heap()->safepoint_synchronize_begin();
  381
  382   // By getting the Threads_lock, we assure that no threads are about to 
start or
  383   // exit. It is released again in SafepointSynchronize::end().
  384   Threads_lock->lock();

I had a measurement there but it was never any time spent there, something to 
keep in mind at least.

/Robbin

On 2019-01-15 21:34, Aleksey Shipilev wrote:
> On 1/15/19 12:33 PM, Robbin Ehn wrote:
>> On 2019-01-15 11:48, Andrew Haley wrote:
>>> On 1/15/19 10:39 AM, Robbin Ehn wrote:
>>>> - On a 16 strand machine synchronization and un-synchronization/starting is at
>>>>      least 3x faster (in non-trivial test). Synchronization ~600 -> ~100us and
>>>>      starting ~400->~100us.
>>>
>>> Thanks. Could you share that benchmark? It'd be interesting to try it on
>>> other architectures.
>>>
>>
>> Those numbers are calculate from linux perf data, I couldn't remember what I run.
>> But apparently it is also specJVM2008 serial with benchmark thread 16.
>> The reason for serial that is that Aleksey tested a couple of previews with
>> Shenandoah and indicated that serial was a good test.
>> (Aleksey, if you have time please test again)
> 
> Hey. I indeed remember testing the early version of the patch, discovered some Serial throughput
> hit, and Robbin had fixed that then. Unfortunately, SPECjvm does not run cleanly with current
> jdk/jdk. We have Serial-like workload driven by JMH. Re-running it with the current jdk/jdk first to
> get the bearings on possible throughput loss (i7-7820X, 16 threads, -Xms4g -Xmx4g):
> 
>   ZGC, baseline:               19779 ± 79 ops/s
>   ZGC, fast safepoints:        19858 ± 89 ops/s ; no loss!
>   Shenandoah, baseline:        19929 ± 57 ops/s
>   Shenandoah, fast safepoints: 19932 ± 49 ops/s ; no loss!
> 
> Excellent, both collectors indicate no throughput loss.
> 
> Now, there are two ways to measure the safepointing overhead. First, Shenandoah measures both "net"
> pause time (VM operation time, basically), and "gross" pause time (which includes both entering and
> leaving the safepoint, as well as VM operation time). With longer Serial run to get more GC cycles
> going:
> 
>   # Shenandoah, baseline
>   [info][gc,stats] Total Pauses (G) = 0.47 s (a = 1139 us) (n = 416)
>                                              (lvls, us = 203, 381, 426, 2324, 4705)
>   [info][gc,stats] Total Pauses (N) = 0.09 s (a =  216 us) (n = 416)
>                                              (lvls, us =  27, 182, 203,  223, 1258)
> 
>   # Shenandoah, fast safepoints
>   [info][gc,stats] Total Pauses (G) = 0.21 s (a = 500 us) (n = 414)
>                                              (lvls, us = 191, 398, 418,  443, 8992)
>   [info][gc,stats] Total Pauses (N) = 0.10 s (a = 229 us) (n = 414)
>                                              (lvls, us =  26, 184, 217,  244,  2327)
> 
> Ta-da! Gross pause times dropped twice. Which means that before we spent ~3/4 ms in safepoint infra,
> and now we only spend ~1/4 ms.
> 
> The second way is to parse safepoint logs. JMH has -prof safepoints that does it automatically. (It
> also prints percentiles and event counts, omitted here for brevity):
> 
> Benchmark                               Mode   Cnt       Score    Error  Units
> 
> # ZGC, baseline
> Serial.test                            thrpt    10   19710.077 ± 96.073  ops/s
> Serial.test:·safepoints.pause          thrpt  1260    1310.022              ms
> Serial.test:·safepoints.pause.avg      thrpt             1.040              ms
> Serial.test:·safepoints.ttsp           thrpt  1260    1053.717              ms
> Serial.test:·safepoints.ttsp.avg       thrpt             0.836              ms
> 
> # ZGC, fast safepoints
> Serial.test                            thrpt    10   19904.687 ± 39.679  ops/s
> Serial.test:·safepoints.pause          thrpt  1275     291.520              ms
> Serial.test:·safepoints.pause.avg      thrpt             0.229              ms
> Serial.test:·safepoints.ttsp           thrpt  1275      92.065              ms
> Serial.test:·safepoints.ttsp.avg       thrpt             0.072              ms ; 11.6x lower!
> 
> # Shenandoah, baseline
> Serial.test                            thrpt   10   19874.508 ± 47.175  ops/s
> Serial.test:·safepoints.pause          thrpt  389     425.723              ms
> Serial.test:·safepoints.pause.avg      thrpt            1.094              ms
> Serial.test:·safepoints.ttsp           thrpt  389     312.413              ms
> Serial.test:·safepoints.ttsp.avg       thrpt            0.803              ms
> 
> # Shenandoah, fast safepoints
> Serial.test                            thrpt   10   19918.094 ± 48.058  ops/s
> Serial.test:·safepoints.pause          thrpt  387     159.440              ms
> Serial.test:·safepoints.pause.avg      thrpt            0.412              ms
> Serial.test:·safepoints.ttsp           thrpt  387      41.941              ms
> Serial.test:·safepoints.ttsp.avg       thrpt            0.108              ms ; 7.4x lower!
> 
> So, TTSP (time to safepoint) had improved majestically. The rest of "gross" pause time measured by
> Shenandoah is quite probably the safepoint leaving latency.
> 
> Also did a spot build checks: the patch builds in fastdebug on aarch64, s390x, x86_32.
> 
> -Aleksey
>