About using SetPriorityClass to reduce reaching safepoint duration

Wed May 13 18:06:57 UTC 2020

Hi David,

> Which JDK version? Do you have data on what the safepoints are i.e. all
> for GC or something else? Which GC?

The experiment was done with JDK15. However the original problem was with
1.8.0_202 with CMS
GC logs look like:
===
2020-05-11T17:09:26.222+0300: 45.772: Application time: 0.0171840 seconds
2020-05-11T17:09:26.222+0300: 45.772: [GC (CMS Final Remark) [YG occupancy:
2094 K (19648 K)]2020-05-11T17:09:26.222+0300: 45.772: [Rescan (parallel) ,
0.0010097 secs]2020-05-11T17:09:26.223+0300: 45.773: [weak refs
processing2020-05-11T17:09:26.223+0300: 45.773: [SoftReference, 0 refs,
0.0000082 secs]2020-05-11T17:09:26.223+0300: 45.773: [WeakReference, 0
refs, 0.0000070 secs]2020-05-11T17:09:26.223+0300: 45.773: [FinalReference,
14 refs, 0.0020968 secs]2020-05-11T17:09:26.225+0300: 45.775:
[PhantomReference, 0 refs, 0 refs, 0.0000123
secs]2020-05-11T17:09:26.225+0300: 45.775: [JNI Weak Reference, 0.0000648
secs], 0.0022370 secs]2020-05-11T17:09:26.225+0300: 45.775: [class
unloading, 0.0028190 secs]2020-05-11T17:09:26.228+0300: 45.778: [scrub
symbol table, 0.0042937 secs]2020-05-11T17:09:26.232+0300: 45.782: [scrub
string table, 0.0004963 secs][1 CMS-remark: 30194K(43712K)] 32289K(63360K),
0.0111991 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]
2020-05-11T17:09:26.233+0300: 45.783: Total time for which application
threads were stopped: 0.0112880 seconds, Stopping threads took: 0.0000237
seconds
2020-05-11T17:09:26.233+0300: 45.783: [CMS-concurrent-sweep-start]
2020-05-11T17:09:26.242+0300: 45.792: [CMS-concurrent-sweep: 0.009/0.009
secs] [Times: user=0.02 sys=0.00, real=0.01 secs]
2020-05-11T17:09:26.242+0300: 45.792: [CMS-concurrent-reset-start]
2020-05-11T17:09:26.242+0300: 45.792: [CMS-concurrent-reset: 0.000/0.000
secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
2020-05-11T17:09:26.245+0300: 45.795: Application time: 0.0113318 seconds
2020-05-11T17:09:33.602+0300: 53.152: Total time for which application
threads were stopped: 7.3572265 seconds, Stopping threads took: 7.3571745
seconds
2020-05-11T17:09:33.602+0300: 53.152: Application time: 0.0000331 seconds
2020-05-11T17:09:33.602+0300: 53.152: Total time for which application
threads were stopped: 0.0000399 seconds, Stopping threads took: 0.0000132
seconds
2020-05-11T17:09:33.602+0300: 53.152: Application time: 0.0000459 seconds
2020-05-11T17:09:33.602+0300: 53.152: Total time for which application
threads were stopped: 0.0000302 seconds, Stopping threads took: 0.0000089
seconds
2020-05-11T17:09:33.602+0300: 53.152: Application time: 0.0000340 seconds
2020-05-11T17:09:33.602+0300: 53.152: Total time for which application
threads were stopped: 0.0000279 seconds, Stopping threads took: 0.0000082
seconds
2020-05-11T17:09:33.602+0300: 53.152: Application time: 0.0000980 seconds
===
2020-05-11T17:09:17.814+0300: 37.185: Total time for which application
threads were stopped: 0.0100980 seconds, Stopping threads took: 0.0000201
seconds
2020-05-11T17:09:18.814+0300: 38.186: Application time: 1.0003422 seconds
2020-05-11T17:09:18.814+0300: 38.186: Total time for which application
threads were stopped: 0.0000706 seconds, Stopping threads took: 0.0000215
seconds
2020-05-11T17:09:19.815+0300: 39.186: Application time: 1.0003862 seconds
2020-05-11T17:09:19.815+0300: 39.186: Total time for which application
threads were stopped: 0.0000866 seconds, Stopping threads took: 0.0000403
seconds
2020-05-11T17:09:20.290+0300: 39.661: Application time: 0.4746837 seconds
2020-05-11T17:09:20.290+0300: 39.661: Total time for which application
threads were stopped: 0.0002432 seconds, Stopping threads took: 0.0000171
seconds
2020-05-11T17:09:20.307+0300: 39.679: Application time: 0.0175904 seconds
2020-05-11T17:09:20.308+0300: 39.679: Total time for which application
threads were stopped: 0.0002325 seconds, Stopping threads took: 0.0000196
seconds
2020-05-11T17:09:20.570+0300: 39.942: Application time: 0.2634484 seconds
2020-05-11T17:09:26.496+0300: 45.867: Total time for which application
threads were stopped: 5.9250744 seconds, Stopping threads took: 5.9247990
seconds
2020-05-11T17:09:26.496+0300: 45.867: Application time: 0.0000920 seconds
2020-05-11T17:09:26.496+0300: 45.868: Total time for which application
threads were stopped: 0.0000727 seconds, Stopping threads took: 0.0000111
seconds
2020-05-11T17:09:26.496+0300: 45.868: Application time: 0.0000702 seconds
===
Please note these
... Stopping threads took: 7.3571745 seconds
and
... Stopping threads took: 5.9247990 seconds
The other pauses are short.
As far as I can understand, these safepoints are not related to garbage
collection.
This is Windows 10, amd64, 4 physical and logical processors, about 10 JVMs
running at the same time.

> Messing with OS level thread priorities is a big "can of worms" and can
> lead to more scheduling aberrations than it cures. If you have an
> overloaded system with multiple JVMs running then you are already in a
> bad place. How do you decide which of those JVMs is most important? If
> you boost the priority of threads trying to reach safepoints then you
> skew the CPU allocation to those JVMs that execute a lot of safepoints
> and have a lot of threads. This may make those applications quite happy
> but are they more important than the other applications on the system?
> And you are not just impacting JVMs but all processes running on the
> machine. You'd need to investigate the performance impact on the system
> as a whole. And if different VMs have overlapping safepoints you just
> end up boosting them all. Even within a JVM the boosted GC threads would
> interfere with the boosted JavaThreads.

Let's look at this issue from the other angle.
When multiple threads from different processes (or the same process)
compete for CPU time, it's up to OS to schedule them fairly. However
safepoints make things a bit uneven. When given VM starts a safepoint and
the one of its app threads is at the end of the processor queue, the
preceding (in the queue) threads (mostly from the other processes) do not
just prevent the latter thread from working, they block whole our VM,
including those threads that have already reached the safepoint. At this
moment the CPU allocation is actually skewed against the VM that starts
safepoint simply because the threads that have already reached safepoint
relinquish their CPU quantum and patiently waits for the others. The issue
looks for me something similar to priority inversion.
The priority boost is intended to be temporary and very short. After the
safepoint the app threads will continue their work with their normal
priorities without any preferences over the others.
I agree that the performance impact on the whole system is needed to be
investigated.
Also it seems it is worth to look for something like inter-process load
balancing thread pools.

Thank you,
Dmytro

>
>