Minor GCs consuming an extreme amount of user CPU time

Hohensee, Paul hohensee at amazon.com
Fri Nov 22 15:48:57 UTC 2024


-XX:+PrintGCDetails would help. And I'd upgrade to JDK 11 if you want to continue using CMS, but better would be to upgrade to JDK 17 or 21 and use G1 instead.

Given the very long pause times, I suspect, in likelyhood order, inter-socket memory accesses, actual swapping, or perhaps an OS pause linked to log writes. Vis

> Memory: 4k page, physical 196374708k(167484808k free), swap
> 4194300k(3010032k free)

> -XX:InitialHeapSize=161061273600 -XX:MaxHeapSize=161061273600
> -XX:MaxNewSize=2878971904 -XX:MaxTenuringThreshold=6

CMS wasn't designed for more than ~8gb heaps (even with 24 physical cores), and this one is 153gb with a relatively small 2.7gb young gen, so heap config may be a factor. Free memory at JVM launch looks to be 167484808k and -Xmx is 157286400k, difference 10198408k. The JVM uses native memory for various things (e.g., the card table, which would be 307200k), but that along probably wouldn't use up all of the left over 10gb.

Your heap spans two sockets, and memory access latency between sockets is multiples of intra-socket latency (I don't know off-hand the difference in this case, but it used to be a factor of 5 or more). The OS can dynamically move physical pages between sockets based on memory access patterns. In this case you might be doing GCs that do mostly cross-socket accesses until the OS figures out what's happening and moves the pages. On multiple socket systems, it's usually better to run one JVM per socket using numactl, which would guarantee intra-socket memory accesses.

There may be other other processes running that use up memory and provoke swapping.

JVM logging is synchronous in JDK 8, which might indirectly cause long pauses due to the OS having to flush its I/O cache to flash or disk.

Regards,
Paul

On 11/22/24, 12:51 AM, "hotspot-gc-use on behalf of Thomas Schatzl" <hotspot-gc-use-retn at openjdk.org <mailto:hotspot-gc-use-retn at openjdk.org> on behalf of thomas.schatzl at oracle.com <mailto:thomas.schatzl at oracle.com>> wrote:

Hi,

not having a lot of experience with CMS, the following could be
potential issues here:

* this is a burstable VM and it sporadically does not get cpu time. Does
not seem so for such a large setup for me.

* I am not sure about CMS (it's really been a long time) but I think we
fixed some issue with large arrays of references not being well
distributed across threads in Parallel GC (fairly recently), and CMS
might have the same issue.

Not sure this would show up as all threads being busy for a long time.

Afair there is not much logging in ParNew/CMS either
(-XX:+PrintGCDetails may help though).

Maybe you know whether this can happen in your application.

A heap histogram may also help during that time (but

* maybe parallel reference processing not enabled. Use
-XX:+PrintReferenceGC / -XX:+ParallelRefProcEnabled to enable.

Maybe there is still somebody else out there more knowledgable about CMS
tuning/issues than me.

Hth,
Thomas


On 19.11.24 17:45, Shane Cox wrote:
> We are observing a transient behavior where Minor GCs consume an extreme
> amount of user CPU time. Below is an excerpt from a GC log illustrating
> this behavior. The first line is representative of typical behavior.
> Starting with the second line, we observe progressively longer Minor GC
> pause times and attendant user CPU usage. The last line shows GC
> behavior returning to normal.
>
> 2024-11-19T05:31:57.096-0500: 39910.586: [GC (Allocation Failure)
> 2024-11-19T05:31:57.096-0500: 39910.587: [ParNew:
> 2276397K->23593K(2530368K), 0.8039922 secs]
> 34156897K->32412975K(157005312K), 0.8042477 secs] [Times: user=26.09
> sys=0.28, real=0.81 secs]
>
> 2024-11-19T05:32:01.942-0500: 39915.432: [GC (Allocation Failure)
> 2024-11-19T05:32:01.942-0500: 39915.433: [ParNew:
> 2272873K->22588K(2530368K), 5.9797785 secs]
> 34662255K->32413062K(157005312K), 5.9800455 secs] [Times: user=195.82
> sys=0.16, real=5.98 secs]
>
> 2024-11-19T05:32:08.811-0500: 39922.302: [GC (Allocation Failure)
> 2024-11-19T05:32:08.812-0500: 39922.302: [ParNew:
> 2271868K->24057K(2530368K), 30.6785198 secs]
> 34662342K->32417812K(157005312K), 30.6788152 secs] [Times: user=1005.22
> sys=0.69, real=30.68 secs]
>
> 2024-11-19T05:32:40.245-0500: 39953.736: [GC (Allocation Failure)
> 2024-11-19T05:32:40.245-0500: 39953.736: [ParNew:
> 2273337K->41786K(2530368K), 48.4038572 secs]
> 34667092K->32436550K(157005312K), 48.4041764 secs] [Times: user=1586.51
> sys=0.85, real=48.41 secs]
>
> 2024-11-19T05:33:32.236-0500: 40005.726: [GC (Allocation Failure)
> 2024-11-19T05:33:32.236-0500: 40005.726: [ParNew:
> 2291066K->59232K(2530368K), 63.8453443 secs]
> 34685830K->32454926K(157005312K), 63.8456820 secs] [Times: user=2091.55
> sys=1.66, real=63.85 secs]
>
> 2024-11-19T05:34:36.698-0500: 40070.189: [GC (Allocation Failure)
> 2024-11-19T05:34:36.698-0500: 40070.189: [ParNew:
> 2308512K->74883K(2530368K), 78.2503259 secs]
> 34704206K->32471817K(157005312K), 78.2506265 secs] [Times: user=2565.48
> sys=2.21, real=78.25 secs]
>
> 2024-11-19T05:35:55.417-0500: 40148.908: [GC (Allocation Failure)
> 2024-11-19T05:35:55.418-0500: 40148.908: [ParNew:
> 2324163K->102130K(2530368K), 96.4663544 secs]
> 34721097K->32500405K(157005312K), 96.4668756 secs] [Times: user=3161.99
> sys=2.26, real=96.46 secs]
>
> 2024-11-19T05:37:32.503-0500: 40245.993: [GC (Allocation Failure)
> 2024-11-19T05:37:32.503-0500: 40245.993: [ParNew:
> 2351410K->108861K(2530368K), 122.1160632 secs]
> 34749685K->32508388K(157005312K), 122.1164558 secs] [Times: user=4001.43
> sys=3.68, real=122.12 secs]
>
> 2024-11-19T05:39:35.187-0500: 40368.678: [GC (Allocation Failure)
> 2024-11-19T05:39:35.188-0500: 40368.678: [ParNew:
> 2358141K->144672K(2530368K), 161.7914697 secs]
> 34757668K->32546599K(157005312K), 161.7916831 secs] [Times: user=5302.53
> sys=4.43, real=161.79 secs]
>
> 2024-11-19T05:42:17.358-0500: 40530.849: [GC (Allocation Failure)
> 2024-11-19T05:42:17.359-0500: 40530.849: [ParNew:
> 2393952K->200667K(2530368K), 279.7220815 secs]
> 34795879K->32612580K(157005312K), 279.7224630 secs] [Times: user=9164.75
> sys=8.37, real=279.72 secs]
>
> 2024-11-19T05:46:57.492-0500: 40810.983: [GC (Allocation Failure)
> 2024-11-19T05:46:57.492-0500: 40810.983: [ParNew:
> 2449947K->250145K(2530368K), 0.2150744 secs]
> 34861860K->32690960K(157005312K), 0.2152389 secs] [Times: user=5.89
> sys=0.70, real=0.22 secs]
>
> The GC log records above do not indicate that an excessive amount of
> GC-related work is being done. Therefore, I’m guessing that most of the
> user CPU time is not spent doing productive work. Instead, I’m guessing
> that the GC threads were just spinning.
>
> The CPU usage on the host was 68-69% during this period. That makes
> sense based on the GC log records. For example, the second to last GC
> log record reports 9164 CPU seconds and 279 real seconds, which yields a
> ratio of 32.85. Divide 32.85 by the 48 CPUs and you get a usage of
> 68.4% … which is the CPU usage that we observed on the host at this
> time. That being the case, it appears that all of the observed CPU
> usage was consumed by the GC threads.
>
> Any idea what the GC threads might be doing that’s consuming so much CPU
> usage? Or suggestions for how I might determine what the GC threads are
> doing?
>
> Thanks,
>
> Shane Cox
>
> First few lines of GC log containing configuration information:
>
> OpenJDK 64-Bit Server VM (25.212-b04) for linux-amd64 JRE
> (1.8.0_212-b04), built on Apr 19 2019 23:23:17 by "root" with gcc 7.3.1
> 20180303 (Red Hat 7.3.1-5)
>
> Memory: 4k page, physical 196374708k(167484808k free), swap
> 4194300k(3010032k free)
>
> CommandLine flags: -XX:CMSInitiatingOccupancyFraction=75
> -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError
> -XX:InitialHeapSize=161061273600 -XX:MaxHeapSize=161061273600
> -XX:MaxNewSize=2878971904 -XX:MaxTenuringThreshold=6
> -XX:NewSize=2878971904 -XX:OldPLABSize=16 -XX:OldSize=5757943808
> -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
>
> uname output:
>
> Linux 4.18.0-305.45.1.el8_4.x86_64 #1 SMP Wed Apr 6 13:48:37 EDT 2022
> x86_64 x86_64 x86_64 GNU/Linux
>
> lscpu output:
>
> Architecture: x86_64
>
> CPU op-mode(s): 32-bit, 64-bit
>
> Byte Order: Little Endian
>
> CPU(s): 48
>
> On-line CPU(s) list: 0-47
>
> Thread(s) per core: 2
>
> Core(s) per socket: 12
>
> Socket(s): 2
>
> NUMA node(s): 2
>
> Vendor ID: GenuineIntel
>
> CPU family: 6
>
> Model: 85
>
> Model name: Intel(R) Xeon(R) Gold 6246 CPU @ 3.30GHz
>
> Stepping: 7
>
> CPU MHz: 4100.050
>
> CPU max MHz: 4200.0000
>
> CPU min MHz: 1200.0000
>
> BogoMIPS: 6600.00
>
> Virtualization: VT-x
>
> L1d cache: 32K
>
> L1i cache: 32K
>
> L2 cache: 1024K
>
> L3 cache: 25344K
>
> NUMA node0 CPU(s):
> 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
>
> NUMA node1 CPU(s):
> 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
>
>
> ------------------------------------------------------------------------
>
> This message may contain confidential information and is intended for
> specific recipients unless explicitly noted otherwise. If you have
> reason to believe you are not an intended recipient of this message,
> please delete it and notify the sender. This message may not represent
> the opinion of Intercontinental Exchange, Inc. (ICE), its subsidiaries
> or affiliates, and does not constitute a contract or guarantee.
> Unencrypted electronic mail is not secure and the recipient of this
> message is expected to provide safeguards from viruses and pursue
> alternate means of communication where privacy or a binding message is
> desired.
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.org <mailto:hotspot-gc-use at openjdk.org>
> https://mail.openjdk.org/mailman/listinfo/hotspot-gc-use <https://mail.openjdk.org/mailman/listinfo/hotspot-gc-use>


_______________________________________________
hotspot-gc-use mailing list
hotspot-gc-use at openjdk.org <mailto:hotspot-gc-use at openjdk.org>
https://mail.openjdk.org/mailman/listinfo/hotspot-gc-use <https://mail.openjdk.org/mailman/listinfo/hotspot-gc-use>





More information about the hotspot-gc-use mailing list