From Shane.Cox at ice.com  Tue Nov 19 16:45:31 2024
From: Shane.Cox at ice.com (Shane Cox)
Date: Tue, 19 Nov 2024 16:45:31 +0000
Subject: Minor GCs consuming an extreme amount of user CPU time
Message-ID: <CYYPR22MB42623DEEC4445ABDB18E6C46F1202@CYYPR22MB4262.namprd22.prod.outlook.com>

We are observing a transient behavior where Minor GCs consume an extreme amount of user CPU time.  Below is an excerpt from a GC log illustrating this behavior.  The first line is representative of typical behavior.  Starting with the second line, we observe progressively longer Minor GC pause times and attendant user CPU usage.  The last line shows GC behavior returning to normal.

2024-11-19T05:31:57.096-0500: 39910.586: [GC (Allocation Failure) 2024-11-19T05:31:57.096-0500: 39910.587: [ParNew: 2276397K->23593K(2530368K), 0.8039922 secs] 34156897K->32412975K(157005312K), 0.8042477 secs] [Times: user=26.09 sys=0.28, real=0.81 secs]
2024-11-19T05:32:01.942-0500: 39915.432: [GC (Allocation Failure) 2024-11-19T05:32:01.942-0500: 39915.433: [ParNew: 2272873K->22588K(2530368K), 5.9797785 secs] 34662255K->32413062K(157005312K), 5.9800455 secs] [Times: user=195.82 sys=0.16, real=5.98 secs]
2024-11-19T05:32:08.811-0500: 39922.302: [GC (Allocation Failure) 2024-11-19T05:32:08.812-0500: 39922.302: [ParNew: 2271868K->24057K(2530368K), 30.6785198 secs] 34662342K->32417812K(157005312K), 30.6788152 secs] [Times: user=1005.22 sys=0.69, real=30.68 secs]
2024-11-19T05:32:40.245-0500: 39953.736: [GC (Allocation Failure) 2024-11-19T05:32:40.245-0500: 39953.736: [ParNew: 2273337K->41786K(2530368K), 48.4038572 secs] 34667092K->32436550K(157005312K), 48.4041764 secs] [Times: user=1586.51 sys=0.85, real=48.41 secs]
2024-11-19T05:33:32.236-0500: 40005.726: [GC (Allocation Failure) 2024-11-19T05:33:32.236-0500: 40005.726: [ParNew: 2291066K->59232K(2530368K), 63.8453443 secs] 34685830K->32454926K(157005312K), 63.8456820 secs] [Times: user=2091.55 sys=1.66, real=63.85 secs]
2024-11-19T05:34:36.698-0500: 40070.189: [GC (Allocation Failure) 2024-11-19T05:34:36.698-0500: 40070.189: [ParNew: 2308512K->74883K(2530368K), 78.2503259 secs] 34704206K->32471817K(157005312K), 78.2506265 secs] [Times: user=2565.48 sys=2.21, real=78.25 secs]
2024-11-19T05:35:55.417-0500: 40148.908: [GC (Allocation Failure) 2024-11-19T05:35:55.418-0500: 40148.908: [ParNew: 2324163K->102130K(2530368K), 96.4663544 secs] 34721097K->32500405K(157005312K), 96.4668756 secs] [Times: user=3161.99 sys=2.26, real=96.46 secs]
2024-11-19T05:37:32.503-0500: 40245.993: [GC (Allocation Failure) 2024-11-19T05:37:32.503-0500: 40245.993: [ParNew: 2351410K->108861K(2530368K), 122.1160632 secs] 34749685K->32508388K(157005312K), 122.1164558 secs] [Times: user=4001.43 sys=3.68, real=122.12 secs]
2024-11-19T05:39:35.187-0500: 40368.678: [GC (Allocation Failure) 2024-11-19T05:39:35.188-0500: 40368.678: [ParNew: 2358141K->144672K(2530368K), 161.7914697 secs] 34757668K->32546599K(157005312K), 161.7916831 secs] [Times: user=5302.53 sys=4.43, real=161.79 secs]
2024-11-19T05:42:17.358-0500: 40530.849: [GC (Allocation Failure) 2024-11-19T05:42:17.359-0500: 40530.849: [ParNew: 2393952K->200667K(2530368K), 279.7220815 secs] 34795879K->32612580K(157005312K), 279.7224630 secs] [Times: user=9164.75 sys=8.37, real=279.72 secs]
2024-11-19T05:46:57.492-0500: 40810.983: [GC (Allocation Failure) 2024-11-19T05:46:57.492-0500: 40810.983: [ParNew: 2449947K->250145K(2530368K), 0.2150744 secs] 34861860K->32690960K(157005312K), 0.2152389 secs] [Times: user=5.89 sys=0.70, real=0.22 secs]


The GC log records above do not indicate that an excessive amount of GC-related work is being done.  Therefore, I'm guessing that most of the user CPU time is not spent doing productive work.  Instead, I'm guessing that the GC threads were just spinning.

The CPU usage on the host was 68-69% during this period.  That makes sense based on the GC log records.  For example, the second to last GC log record reports 9164 CPU seconds and 279 real seconds, which yields a ratio of 32.85.  Divide 32.85 by the 48 CPUs and you get a usage of 68.4% ... which is the CPU usage that we observed on the host at this time.  That being the case, it appears that all of the observed CPU usage was consumed by the GC threads.


Any idea what the GC threads might be doing that's consuming so much CPU usage?  Or suggestions for how I might determine what the GC threads are doing?


Thanks,
Shane Cox


First few lines of GC log containing configuration information:
OpenJDK 64-Bit Server VM (25.212-b04) for linux-amd64 JRE (1.8.0_212-b04), built on Apr 19 2019 23:23:17 by "root" with gcc 7.3.1 20180303 (Red Hat 7.3.1-5)
Memory: 4k page, physical 196374708k(167484808k free), swap 4194300k(3010032k free)
CommandLine flags: -XX:CMSInitiatingOccupancyFraction=75 -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError -XX:InitialHeapSize=161061273600 -XX:MaxHeapSize=161061273600 -XX:MaxNewSize=2878971904 -XX:MaxTenuringThreshold=6 -XX:NewSize=2878971904 -XX:OldPLABSize=16 -XX:OldSize=5757943808 -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseConcMarkSweepGC -XX:+UseParNewGC


uname output:
Linux 4.18.0-305.45.1.el8_4.x86_64 #1 SMP Wed Apr 6 13:48:37 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux

lscpu output:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  2
Core(s) per socket:  12
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6246 CPU @ 3.30GHz
Stepping:            7
CPU MHz:             4100.050
CPU max MHz:         4200.0000
CPU min MHz:         1200.0000
BogoMIPS:            6600.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            25344K
NUMA node0 CPU(s):   0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
NUMA node1 CPU(s):   1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47

________________________________

This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of Intercontinental Exchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-use/attachments/20241119/caf8acd6/attachment.htm>

From 4itmed at gmail.com  Thu Nov 21 01:20:09 2024
From: 4itmed at gmail.com (Mohammed)
Date: Thu, 21 Nov 2024 02:20:09 +0100
Subject: Very long full G1 GC pauses
Message-ID: <CAOj4pqCYqcEgAJpJd5drtxhSxJbviDTNVtQZ=k3DiLkBXSWyiQ@mail.gmail.com>

Dear GC Community,

I hope this message finds you well... I'm currently experiencing
performance issues related to very long garbage collection (GC) pauses in
our application, which uses the G1 on Java 8.

https://gceasy.io/my-gc-report.jsp?p=YXJjaGl2ZWQvMjAyNC8xMS8xOC9qYXZhX2djIC0gMTh0aCBOb3YubG9nLS0xMi01Mi0xMA==&channel=WEB

We increased the heap size to 14 GB but it didn't help:

https://gceasy.io/my-gc-report.jsp?p=YXJjaGl2ZWQvMjAyNC8xMS8yMC9qYXZhX2djLmxvZy0tMTctNDEtMjk=&channel=WEB

I highly appreciate any insights or recommendations to fine tune it, please.

Thank you in advance for your time and support.


OpenJDK 64-Bit Server VM (25.422-b05) for linux-amd64 JRE (1.8.0_422-b05),
built on Jul 10 2024 20:58:03 by "mockbuild" with gcc 4.8.5 20150623 (Red
Hat 4.8.5-44)
Memory: 4k page, physical 22439452k(19045348k free), swap 4194300k(4194300k
free)
CommandLine flags: -XX:InitialHeapSize=15032385536
-XX:InitialTenuringThreshold=4 -XX:InitiatingHeapOccupancyPercent=30
-XX:+ManagementServer -XX:MaxHeapSize=15032385536
-XX:MaxTenuringThreshold=4 -XX:+ParallelRefProcEnabled -XX:+PrintGC
-XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-XX:+PrintHeapAtGC -XX:+PrintReferenceGC -XX:+PrintTenuringDistribution
-XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseG1GC


2024-11-20T10:01:23.823+0100: 1.757: [GC pause (Metadata GC Threshold)
(young) (initial-mark)
2024-11-20T10:01:23.839+0100: 1.773: [SoftReference, 0 refs, 0.0010719 secs]
2024-11-20T10:01:23.840+0100: 1.774: [WeakReference, 492 refs, 0.0005143
secs]
2024-11-20T10:01:23.841+0100: 1.775: [FinalReference, 1914 refs, 0.0025155
secs]
2024-11-20T10:01:23.843+0100: 1.777: [PhantomReference, 0 refs, 0.0006570
secs]
2024-11-20T10:01:23.844+0100: 1.778: [JNI Weak Reference, 0.0000576 secs],
0.0223584 secs]
2024-11-20T10:01:23.846+0100: 1.779: [GC concurrent-root-region-scan-start]
2024-11-20T10:01:23.850+0100: 1.783: [GC concurrent-root-region-scan-end,
0.0039879 secs]
2024-11-20T10:01:23.850+0100: 1.783: [GC concurrent-mark-start]
2024-11-20T10:01:23.855+0100: 1.789: [GC concurrent-mark-end, 0.0053844
secs]
2024-11-20T10:01:29.883+0100: 7.817: [GC pause (G1 Evacuation Pause) (young)
2024-11-20T10:01:31.675+0100: 9.609: [GC pause (G1 Evacuation Pause) (young)
2024-11-20T10:01:33.353+0100: 11.286: [GC pause (G1 Evacuation Pause)
(young)
2024-11-20T10:01:35.989+0100: 13.923: [GC pause (G1 Evacuation Pause)
(young)
2024-11-20T10:01:37.481+0100: 15.414: [GC pause (G1 Evacuation Pause)
(young)
2024-11-20T10:01:40.874+0100: 18.807: [GC pause (G1 Evacuation Pause)
(young)
2024-11-20T10:01:43.229+0100: 21.163: [GC pause (G1 Evacuation Pause)
(young)
2024-11-20T10:01:46.703+0100: 24.637: [GC pause (Metadata GC Threshold)
(young) (initial-mark)
2024-11-20T10:01:55.044+0100: 32.977: [GC pause (G1 Evacuation Pause)
(young)

2024-11-20T15:00:04.646+0100: 17922.579: [GC pause (G1 Evacuation Pause)
(young) (initial-mark)
2024-11-20T15:00:04.689+0100: 17922.622: [GC
concurrent-root-region-scan-start]
2024-11-20T15:00:04.725+0100: 17922.659: [GC
concurrent-root-region-scan-end, 0.0365063 secs]
2024-11-20T15:00:04.725+0100: 17922.659: [GC concurrent-mark-start]
2024-11-20T15:00:06.107+0100: 17924.040: [GC pause (G1 Evacuation Pause)
(young)
2024-11-20T15:00:12.337+0100: 17930.270: [GC cleanup
10530M->10490M(14336M), 0.0788429 secs]
2024-11-20T15:00:12.416+0100: 17930.349: [GC concurrent-cleanup-start]
2024-11-20T15:00:15.700+0100: 17933.633: [GC pause (G1 Evacuation Pause)
(mixed)
2024-11-20T15:00:19.262+0100: 17937.196: [GC pause (G1 Evacuation Pause)
(mixed)
2024-11-20T15:00:21.759+0100: 17939.692: [GC pause (G1 Evacuation Pause)
(young) (initial-mark)
2024-11-20T15:00:29.523+0100: 17947.457: [GC concurrent-mark-end, 7.6961570
secs]
2024-11-20T15:00:29.688+0100: 17947.622: [GC cleanup
10369M->10329M(14336M), 0.0734577 secs]
2024-11-20T15:00:47.950+0100: 17965.884: [GC pause (G1 Evacuation Pause)
(young) (initial-mark)
2024-11-20T15:00:55.434+0100: 17973.368: [GC concurrent-mark-end, 7.3457392
secs]

2024-11-20T18:02:39.789+0100: [Full GC (Allocation Failure)
13373M->10294M(14336M), 15.5035776 secs]
2024-11-20T18:02:39.789+0100: [SoftReference, 13012 refs, 0.0020701 secs]
2024-11-20T18:02:39.789+0100: [WeakReference, 140239 refs, 0.0156066 secs]
2024-11-20T18:02:39.789+0100: [FinalReference, 3610 refs, 0.0003202 secs]
2024-11-20T18:02:39.789+0100: [PhantomReference, 194 refs, 0.0000238 secs]
2024-11-20T18:02:39.789+0100: [JNI Weak Reference, 0.0003217 secs]

2024-11-20T18:02:53.371+0100: [Full GC (Allocation Failure)
13472M->10131M(14336M), 16.4023915 secs]
2024-11-20T18:02:53.371+0100: [SoftReference, 13771 refs, 0.0022936 secs]
2024-11-20T18:02:53.371+0100: [WeakReference, 138881 refs, 0.0158819 secs]
2024-11-20T18:02:53.371+0100: [FinalReference, 3644 refs, 0.0003447 secs]
2024-11-20T18:02:53.371+0100: [PhantomReference, 194 refs, 0.0000328 secs]
2024-11-20T18:02:53.371+0100: [JNI Weak Reference, 0.0003272 secs]

2024-11-20T18:03:12.492+0100: [Full GC (Allocation Failure)
12600M->10282M(14336M), 14.9648411 secs]
2024-11-20T18:03:12.492+0100: [SoftReference, 13027 refs, 0.0022496 secs]
2024-11-20T18:03:12.492+0100: [WeakReference, 135613 refs, 0.0150724 secs]
2024-11-20T18:03:12.492+0100: [FinalReference, 3643 refs, 0.0003302 secs]
2024-11-20T18:03:12.492+0100: [PhantomReference, 194 refs, 0.0000216 secs]
2024-11-20T18:03:12.492+0100: [JNI Weak Reference, 0.0003229 secs]

2024-11-20T18:03:31.107+0100: [Full GC (Allocation Failure)
13365M->10281M(14336M), 15.4983270 secs]
2024-11-20T18:03:31.107+0100: [SoftReference, 13110 refs, 0.0021606 secs]
2024-11-20T18:03:31.107+0100: [WeakReference, 135613 refs, 0.0149449 secs]
2024-11-20T18:03:31.107+0100: [FinalReference, 3641 refs, 0.0003340 secs]
2024-11-20T18:03:31.107+0100: [PhantomReference, 194 refs, 0.0000244 secs]
2024-11-20T18:03:31.107+0100: [JNI Weak Reference, 0.0003422 secs]

2024-11-20T18:03:50.243+0100: [Full GC (Allocation Failure)
13360M->10280M(14336M), 15.1688211 secs]
2024-11-20T18:03:50.243+0100: [SoftReference, 13110 refs, 0.0020750 secs]
2024-11-20T18:03:50.243+0100: [WeakReference, 135612 refs, 0.0149417 secs]
2024-11-20T18:03:50.243+0100: [FinalReference, 3641 refs, 0.0003294 secs]
2024-11-20T18:03:50.243+0100: [PhantomReference, 194 refs, 0.0000235 secs]
2024-11-20T18:03:50.243+0100: [JNI Weak Reference, 0.0003195 secs]


Best regards
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-use/attachments/20241121/b26e848a/attachment-0001.htm>

From stefan.johansson at oracle.com  Thu Nov 21 07:35:29 2024
From: stefan.johansson at oracle.com (Stefan Johansson)
Date: Thu, 21 Nov 2024 08:35:29 +0100
Subject: Very long full G1 GC pauses
In-Reply-To: <CAOj4pqCYqcEgAJpJd5drtxhSxJbviDTNVtQZ=k3DiLkBXSWyiQ@mail.gmail.com>
References: <CAOj4pqCYqcEgAJpJd5drtxhSxJbviDTNVtQZ=k3DiLkBXSWyiQ@mail.gmail.com>
Message-ID: <9bdd9289-e0a3-4be7-be1c-0a47d38d0881@oracle.com>

Hi,

A quick look at the GCeasy pages show that the old-generation start to 
increase right before the FullGCs. This is likely caused by an increase 
in allocation rate (likely caused higher load on the application or a 
change in behavior) which lead to more frequent GCs and objects getting 
promoted to the old generation quicker.

Without the complete logs, its hard to say exactly what tuning could or 
should be done. We already seem to run the concurrent marking quite 
frequently, but lowering the InitiatingHeapOccupancyPercent could 
perhaps help avoid the Full GCs. You also seem to have some GCs caused 
by humongous allocations around the Full GCs, depending on how big those 
objects are increasing the G1 regions size could possibly help. Looking 
at your heap size the default region size should be 4m, so increasing it 
to 8m could help: -XX:G1HeapRegionSize=8m.

If you don't have the complete logs easy available, you can collect them 
into a separate file using -Xloggc:<filename>.

Hope this helps,
Stefan


On 2024-11-21 02:20, Mohammed wrote:
> Dear GC Community,
>
> I hope this message finds you well... I'm currently experiencing 
> performance issues related to very long garbage collection (GC) pauses 
> in our application, which uses the G1 on Java 8.
>
> https://gceasy.io/my-gc-report.jsp?p=YXJjaGl2ZWQvMjAyNC8xMS8xOC9qYXZhX2djIC0gMTh0aCBOb3YubG9nLS0xMi01Mi0xMA==&channel=WEB 
> <https://gceasy.io/my-gc-report.jsp?p=YXJjaGl2ZWQvMjAyNC8xMS8xOC9qYXZhX2djIC0gMTh0aCBOb3YubG9nLS0xMi01Mi0xMA==&channel=WEB>
>
> We increased the heap size to 14 GB but it didn't help:
>
> https://gceasy.io/my-gc-report.jsp?p=YXJjaGl2ZWQvMjAyNC8xMS8yMC9qYXZhX2djLmxvZy0tMTctNDEtMjk=&channel=WEB 
> <https://gceasy.io/my-gc-report.jsp?p=YXJjaGl2ZWQvMjAyNC8xMS8yMC9qYXZhX2djLmxvZy0tMTctNDEtMjk=&channel=WEB>
>
> I highly appreciate any insights or recommendations to fine tune it, 
> please.
>
> Thank you in advance for your time and support.
>
>
> OpenJDK 64-Bit Server VM (25.422-b05) for linux-amd64 JRE 
> (1.8.0_422-b05), built on Jul 10 2024 20:58:03 by "mockbuild" with gcc 
> 4.8.5 20150623 (Red Hat 4.8.5-44)
> Memory: 4k page, physical 22439452k(19045348k free), swap 
> 4194300k(4194300k free)
> CommandLine flags: -XX:InitialHeapSize=15032385536 
> -XX:InitialTenuringThreshold=4 -XX:InitiatingHeapOccupancyPercent=30 
> -XX:+ManagementServer -XX:MaxHeapSize=15032385536 
> -XX:MaxTenuringThreshold=4 -XX:+ParallelRefProcEnabled -XX:+PrintGC 
> -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps 
> -XX:+PrintHeapAtGC -XX:+PrintReferenceGC 
> -XX:+PrintTenuringDistribution -XX:+UseCompressedClassPointers 
> -XX:+UseCompressedOops -XX:+UseG1GC
>
>
> 2024-11-20T10:01:23.823+0100: 1.757: [GC pause (Metadata GC Threshold) 
> (young) (initial-mark)
> 2024-11-20T10:01:23.839+0100: 1.773: [SoftReference, 0 refs, 0.0010719 
> secs]
> 2024-11-20T10:01:23.840+0100: 1.774: [WeakReference, 492 refs, 
> 0.0005143 secs]
> 2024-11-20T10:01:23.841+0100: 1.775: [FinalReference, 1914 refs, 
> 0.0025155 secs]
> 2024-11-20T10:01:23.843+0100: 1.777: [PhantomReference, 0 refs, 
> 0.0006570 secs]
> 2024-11-20T10:01:23.844+0100: 1.778: [JNI Weak Reference, 0.0000576 
> secs], 0.0223584 secs]
> 2024-11-20T10:01:23.846+0100: 1.779: [GC 
> concurrent-root-region-scan-start]
> 2024-11-20T10:01:23.850+0100: 1.783: [GC 
> concurrent-root-region-scan-end, 0.0039879 secs]
> 2024-11-20T10:01:23.850+0100: 1.783: [GC concurrent-mark-start]
> 2024-11-20T10:01:23.855+0100: 1.789: [GC concurrent-mark-end, 
> 0.0053844 secs]
> 2024-11-20T10:01:29.883+0100: 7.817: [GC pause (G1 Evacuation Pause) 
> (young)
> 2024-11-20T10:01:31.675+0100: 9.609: [GC pause (G1 Evacuation Pause) 
> (young)
> 2024-11-20T10:01:33.353+0100: 11.286: [GC pause (G1 Evacuation Pause) 
> (young)
> 2024-11-20T10:01:35.989+0100: 13.923: [GC pause (G1 Evacuation Pause) 
> (young)
> 2024-11-20T10:01:37.481+0100: 15.414: [GC pause (G1 Evacuation Pause) 
> (young)
> 2024-11-20T10:01:40.874+0100: 18.807: [GC pause (G1 Evacuation Pause) 
> (young)
> 2024-11-20T10:01:43.229+0100: 21.163: [GC pause (G1 Evacuation Pause) 
> (young)
> 2024-11-20T10:01:46.703+0100: 24.637: [GC pause (Metadata GC 
> Threshold) (young) (initial-mark)
> 2024-11-20T10:01:55.044+0100: 32.977: [GC pause (G1 Evacuation Pause) 
> (young)
>
> 2024-11-20T15:00:04.646+0100: 17922.579: [GC pause (G1 Evacuation 
> Pause) (young) (initial-mark)
> 2024-11-20T15:00:04.689+0100: 17922.622: [GC 
> concurrent-root-region-scan-start]
> 2024-11-20T15:00:04.725+0100: 17922.659: [GC 
> concurrent-root-region-scan-end, 0.0365063 secs]
> 2024-11-20T15:00:04.725+0100: 17922.659: [GC concurrent-mark-start]
> 2024-11-20T15:00:06.107+0100: 17924.040: [GC pause (G1 Evacuation 
> Pause) (young)
> 2024-11-20T15:00:12.337+0100: 17930.270: [GC cleanup 
> 10530M->10490M(14336M), 0.0788429 secs]
> 2024-11-20T15:00:12.416+0100: 17930.349: [GC concurrent-cleanup-start]
> 2024-11-20T15:00:15.700+0100: 17933.633: [GC pause (G1 Evacuation 
> Pause) (mixed)
> 2024-11-20T15:00:19.262+0100: 17937.196: [GC pause (G1 Evacuation 
> Pause) (mixed)
> 2024-11-20T15:00:21.759+0100: 17939.692: [GC pause (G1 Evacuation 
> Pause) (young) (initial-mark)
> 2024-11-20T15:00:29.523+0100: 17947.457: [GC concurrent-mark-end, 
> 7.6961570 secs]
> 2024-11-20T15:00:29.688+0100: 17947.622: [GC cleanup 
> 10369M->10329M(14336M), 0.0734577 secs]
> 2024-11-20T15:00:47.950+0100: 17965.884: [GC pause (G1 Evacuation 
> Pause) (young) (initial-mark)
> 2024-11-20T15:00:55.434+0100: 17973.368: [GC concurrent-mark-end, 
> 7.3457392 secs]
>
> 2024-11-20T18:02:39.789+0100: [Full GC (Allocation Failure) 
> 13373M->10294M(14336M), 15.5035776 secs]
> 2024-11-20T18:02:39.789+0100: [SoftReference, 13012 refs, 0.0020701 secs]
> 2024-11-20T18:02:39.789+0100: [WeakReference, 140239 refs, 0.0156066 secs]
> 2024-11-20T18:02:39.789+0100: [FinalReference, 3610 refs, 0.0003202 secs]
> 2024-11-20T18:02:39.789+0100: [PhantomReference, 194 refs, 0.0000238 secs]
> 2024-11-20T18:02:39.789+0100: [JNI Weak Reference, 0.0003217 secs]
>
> 2024-11-20T18:02:53.371+0100: [Full GC (Allocation Failure) 
> 13472M->10131M(14336M), 16.4023915 secs]
> 2024-11-20T18:02:53.371+0100: [SoftReference, 13771 refs, 0.0022936 secs]
> 2024-11-20T18:02:53.371+0100: [WeakReference, 138881 refs, 0.0158819 secs]
> 2024-11-20T18:02:53.371+0100: [FinalReference, 3644 refs, 0.0003447 secs]
> 2024-11-20T18:02:53.371+0100: [PhantomReference, 194 refs, 0.0000328 secs]
> 2024-11-20T18:02:53.371+0100: [JNI Weak Reference, 0.0003272 secs]
>
> 2024-11-20T18:03:12.492+0100: [Full GC (Allocation Failure) 
> 12600M->10282M(14336M), 14.9648411 secs]
> 2024-11-20T18:03:12.492+0100: [SoftReference, 13027 refs, 0.0022496 secs]
> 2024-11-20T18:03:12.492+0100: [WeakReference, 135613 refs, 0.0150724 secs]
> 2024-11-20T18:03:12.492+0100: [FinalReference, 3643 refs, 0.0003302 secs]
> 2024-11-20T18:03:12.492+0100: [PhantomReference, 194 refs, 0.0000216 secs]
> 2024-11-20T18:03:12.492+0100: [JNI Weak Reference, 0.0003229 secs]
>
> 2024-11-20T18:03:31.107+0100: [Full GC (Allocation Failure) 
> 13365M->10281M(14336M), 15.4983270 secs]
> 2024-11-20T18:03:31.107+0100: [SoftReference, 13110 refs, 0.0021606 secs]
> 2024-11-20T18:03:31.107+0100: [WeakReference, 135613 refs, 0.0149449 secs]
> 2024-11-20T18:03:31.107+0100: [FinalReference, 3641 refs, 0.0003340 secs]
> 2024-11-20T18:03:31.107+0100: [PhantomReference, 194 refs, 0.0000244 secs]
> 2024-11-20T18:03:31.107+0100: [JNI Weak Reference, 0.0003422 secs]
>
> 2024-11-20T18:03:50.243+0100: [Full GC (Allocation Failure) 
> 13360M->10280M(14336M), 15.1688211 secs]
> 2024-11-20T18:03:50.243+0100: [SoftReference, 13110 refs, 0.0020750 secs]
> 2024-11-20T18:03:50.243+0100: [WeakReference, 135612 refs, 0.0149417 secs]
> 2024-11-20T18:03:50.243+0100: [FinalReference, 3641 refs, 0.0003294 secs]
> 2024-11-20T18:03:50.243+0100: [PhantomReference, 194 refs, 0.0000235 secs]
> 2024-11-20T18:03:50.243+0100: [JNI Weak Reference, 0.0003195 secs]
>
>
> Best regards
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.org
> https://mail.openjdk.org/mailman/listinfo/hotspot-gc-use
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-use/attachments/20241121/9221c1d7/attachment.htm>

From thomas.schatzl at oracle.com  Fri Nov 22 08:50:50 2024
From: thomas.schatzl at oracle.com (Thomas Schatzl)
Date: Fri, 22 Nov 2024 09:50:50 +0100
Subject: Minor GCs consuming an extreme amount of user CPU time
In-Reply-To: <CYYPR22MB42623DEEC4445ABDB18E6C46F1202@CYYPR22MB4262.namprd22.prod.outlook.com>
References: <CYYPR22MB42623DEEC4445ABDB18E6C46F1202@CYYPR22MB4262.namprd22.prod.outlook.com>
Message-ID: <dc2015f7-85ba-4a30-8a39-c1528af98d42@oracle.com>

Hi,

   not having a lot of experience with CMS, the following could be 
potential issues here:

* this is a burstable VM and it sporadically does not get cpu time. Does 
not seem so for such a large setup for me.

* I am not sure about CMS (it's really been a long time) but I think we 
fixed some issue with large arrays of references not being well 
distributed across threads in Parallel GC (fairly recently), and CMS 
might have the same issue.

Not sure this would show up as all threads being busy for a long time.

Afair there is not much logging in ParNew/CMS either 
(-XX:+PrintGCDetails may help though).

Maybe you know whether this can happen in your application.

A heap histogram may also help during that time (but

* maybe parallel reference processing not enabled. Use 
-XX:+PrintReferenceGC / -XX:+ParallelRefProcEnabled to enable.

Maybe there is still somebody else out there more knowledgable about CMS 
tuning/issues than me.

Hth,
   Thomas

On 19.11.24 17:45, Shane Cox wrote:
> We are observing a transient behavior where Minor GCs consume an extreme 
> amount of user CPU time.? Below is an excerpt from a GC log illustrating 
> this behavior.? The first line is representative of typical behavior.  
> Starting with the second line, we observe progressively longer Minor GC 
> pause times and attendant user CPU usage.? The last line shows GC 
> behavior returning to normal.
> 
> 2024-11-19T05:31:57.096-0500: 39910.586: [GC (Allocation Failure) 
> 2024-11-19T05:31:57.096-0500: 39910.587: [ParNew: 
> 2276397K->23593K(2530368K), 0.8039922 secs] 
> 34156897K->32412975K(157005312K), 0.8042477 secs] [Times: user=26.09 
> sys=0.28, real=0.81 secs]
> 
> 2024-11-19T05:32:01.942-0500: 39915.432: [GC (Allocation Failure) 
> 2024-11-19T05:32:01.942-0500: 39915.433: [ParNew: 
> 2272873K->22588K(2530368K), 5.9797785 secs] 
> 34662255K->32413062K(157005312K), 5.9800455 secs] [Times: user=195.82 
> sys=0.16, real=5.98 secs]
> 
> 2024-11-19T05:32:08.811-0500: 39922.302: [GC (Allocation Failure) 
> 2024-11-19T05:32:08.812-0500: 39922.302: [ParNew: 
> 2271868K->24057K(2530368K), 30.6785198 secs] 
> 34662342K->32417812K(157005312K), 30.6788152 secs] [Times: user=1005.22 
> sys=0.69, real=30.68 secs]
> 
> 2024-11-19T05:32:40.245-0500: 39953.736: [GC (Allocation Failure) 
> 2024-11-19T05:32:40.245-0500: 39953.736: [ParNew: 
> 2273337K->41786K(2530368K), 48.4038572 secs] 
> 34667092K->32436550K(157005312K), 48.4041764 secs] [Times: user=1586.51 
> sys=0.85, real=48.41 secs]
> 
> 2024-11-19T05:33:32.236-0500: 40005.726: [GC (Allocation Failure) 
> 2024-11-19T05:33:32.236-0500: 40005.726: [ParNew: 
> 2291066K->59232K(2530368K), 63.8453443 secs] 
> 34685830K->32454926K(157005312K), 63.8456820 secs] [Times: user=2091.55 
> sys=1.66, real=63.85 secs]
> 
> 2024-11-19T05:34:36.698-0500: 40070.189: [GC (Allocation Failure) 
> 2024-11-19T05:34:36.698-0500: 40070.189: [ParNew: 
> 2308512K->74883K(2530368K), 78.2503259 secs] 
> 34704206K->32471817K(157005312K), 78.2506265 secs] [Times: user=2565.48 
> sys=2.21, real=78.25 secs]
> 
> 2024-11-19T05:35:55.417-0500: 40148.908: [GC (Allocation Failure) 
> 2024-11-19T05:35:55.418-0500: 40148.908: [ParNew: 
> 2324163K->102130K(2530368K), 96.4663544 secs] 
> 34721097K->32500405K(157005312K), 96.4668756 secs] [Times: user=3161.99 
> sys=2.26, real=96.46 secs]
> 
> 2024-11-19T05:37:32.503-0500: 40245.993: [GC (Allocation Failure) 
> 2024-11-19T05:37:32.503-0500: 40245.993: [ParNew: 
> 2351410K->108861K(2530368K), 122.1160632 secs] 
> 34749685K->32508388K(157005312K), 122.1164558 secs] [Times: user=4001.43 
> sys=3.68, real=122.12 secs]
> 
> 2024-11-19T05:39:35.187-0500: 40368.678: [GC (Allocation Failure) 
> 2024-11-19T05:39:35.188-0500: 40368.678: [ParNew: 
> 2358141K->144672K(2530368K), 161.7914697 secs] 
> 34757668K->32546599K(157005312K), 161.7916831 secs] [Times: user=5302.53 
> sys=4.43, real=161.79 secs]
> 
> 2024-11-19T05:42:17.358-0500: 40530.849: [GC (Allocation Failure) 
> 2024-11-19T05:42:17.359-0500: 40530.849: [ParNew: 
> 2393952K->200667K(2530368K), 279.7220815 secs] 
> 34795879K->32612580K(157005312K), 279.7224630 secs] [Times: user=9164.75 
> sys=8.37, real=279.72 secs]
> 
> 2024-11-19T05:46:57.492-0500: 40810.983: [GC (Allocation Failure) 
> 2024-11-19T05:46:57.492-0500: 40810.983: [ParNew: 
> 2449947K->250145K(2530368K), 0.2150744 secs] 
> 34861860K->32690960K(157005312K), 0.2152389 secs] [Times: user=5.89 
> sys=0.70, real=0.22 secs]
> 
> The GC log records above do not indicate that an excessive amount of 
> GC-related work is being done.? Therefore, I?m guessing that most of the 
> user CPU time is not spent doing productive work.? Instead, I?m guessing 
> that the GC threads were just spinning.
> 
> The CPU usage on the host was 68-69% during this period.? That makes 
> sense based on the GC log records.? For example, the second to last GC 
> log record reports 9164 CPU seconds and 279 real seconds, which yields a 
> ratio of 32.85.? Divide 32.85 by the 48 CPUs and you get a usage of 
> 68.4% ? which is the CPU usage that we observed on the host at this 
> time.? That being the case, it appears that all of the observed CPU 
> usage was consumed by the GC threads.
> 
> Any idea what the GC threads might be doing that?s consuming so much CPU 
> usage?? Or suggestions for how I might determine what the GC threads are 
> doing?
> 
> Thanks,
> 
> Shane Cox
> 
> First few lines of GC log containing configuration information:
> 
> OpenJDK 64-Bit Server VM (25.212-b04) for linux-amd64 JRE 
> (1.8.0_212-b04), built on Apr 19 2019 23:23:17 by "root" with gcc 7.3.1 
> 20180303 (Red Hat 7.3.1-5)
> 
> Memory: 4k page, physical 196374708k(167484808k free), swap 
> 4194300k(3010032k free)
> 
> CommandLine flags: -XX:CMSInitiatingOccupancyFraction=75 
> -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError 
> -XX:InitialHeapSize=161061273600 -XX:MaxHeapSize=161061273600 
> -XX:MaxNewSize=2878971904 -XX:MaxTenuringThreshold=6 
> -XX:NewSize=2878971904 -XX:OldPLABSize=16 -XX:OldSize=5757943808 
> -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
> 
> uname output:
> 
> Linux 4.18.0-305.45.1.el8_4.x86_64 #1 SMP Wed Apr 6 13:48:37 EDT 2022 
> x86_64 x86_64 x86_64 GNU/Linux
> 
> lscpu output:
> 
> Architecture:??????? x86_64
> 
> CPU op-mode(s):????? 32-bit, 64-bit
> 
> Byte Order:????????? Little Endian
> 
> CPU(s):????????????? 48
> 
> On-line CPU(s) list: 0-47
> 
> Thread(s) per core:? 2
> 
> Core(s) per socket:? 12
> 
> Socket(s):?????????? 2
> 
> NUMA node(s):??????? 2
> 
> Vendor ID:?????????? GenuineIntel
> 
> CPU family:????????? 6
> 
> Model:?????????????? 85
> 
> Model name:????????? Intel(R) Xeon(R) Gold 6246 CPU @ 3.30GHz
> 
> Stepping:??????????? 7
> 
> CPU MHz:???????????? 4100.050
> 
> CPU max MHz:???????? 4200.0000
> 
> CPU min MHz:???????? 1200.0000
> 
> BogoMIPS:??????????? 6600.00
> 
> Virtualization:????? VT-x
> 
> L1d cache:?????????? 32K
> 
> L1i cache:?????????? 32K
> 
> L2 cache:??????????? 1024K
> 
> L3 cache:??????????? 25344K
> 
> NUMA node0 CPU(s):   
> 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
> 
> NUMA node1 CPU(s):   
> 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
> 
> 
> ------------------------------------------------------------------------
> 
> This message may contain confidential information and is intended for 
> specific recipients unless explicitly noted otherwise. If you have 
> reason to believe you are not an intended recipient of this message, 
> please delete it and notify the sender. This message may not represent 
> the opinion of Intercontinental Exchange, Inc. (ICE), its subsidiaries 
> or affiliates, and does not constitute a contract or guarantee. 
> Unencrypted electronic mail is not secure and the recipient of this 
> message is expected to provide safeguards from viruses and pursue 
> alternate means of communication where privacy or a binding message is 
> desired.
> 
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.org
> https://mail.openjdk.org/mailman/listinfo/hotspot-gc-use


From hohensee at amazon.com  Fri Nov 22 15:48:57 2024
From: hohensee at amazon.com (Hohensee, Paul)
Date: Fri, 22 Nov 2024 15:48:57 +0000
Subject: Minor GCs consuming an extreme amount of user CPU time
Message-ID: <16EE19E2-058F-45DA-BAB8-5BC838204E9B@amazon.com>

-XX:+PrintGCDetails would help. And I'd upgrade to JDK 11 if you want to continue using CMS, but better would be to upgrade to JDK 17 or 21 and use G1 instead.

Given the very long pause times, I suspect, in likelyhood order, inter-socket memory accesses, actual swapping, or perhaps an OS pause linked to log writes. Vis

> Memory: 4k page, physical 196374708k(167484808k free), swap
> 4194300k(3010032k free)

> -XX:InitialHeapSize=161061273600 -XX:MaxHeapSize=161061273600
> -XX:MaxNewSize=2878971904 -XX:MaxTenuringThreshold=6

CMS wasn't designed for more than ~8gb heaps (even with 24 physical cores), and this one is 153gb with a relatively small 2.7gb young gen, so heap config may be a factor. Free memory at JVM launch looks to be 167484808k and -Xmx is 157286400k, difference 10198408k. The JVM uses native memory for various things (e.g., the card table, which would be 307200k), but that along probably wouldn't use up all of the left over 10gb.

Your heap spans two sockets, and memory access latency between sockets is multiples of intra-socket latency (I don't know off-hand the difference in this case, but it used to be a factor of 5 or more). The OS can dynamically move physical pages between sockets based on memory access patterns. In this case you might be doing GCs that do mostly cross-socket accesses until the OS figures out what's happening and moves the pages. On multiple socket systems, it's usually better to run one JVM per socket using numactl, which would guarantee intra-socket memory accesses.

There may be other other processes running that use up memory and provoke swapping.

JVM logging is synchronous in JDK 8, which might indirectly cause long pauses due to the OS having to flush its I/O cache to flash or disk.

Regards,
Paul

?On 11/22/24, 12:51 AM, "hotspot-gc-use on behalf of Thomas Schatzl" <hotspot-gc-use-retn at openjdk.org <mailto:hotspot-gc-use-retn at openjdk.org> on behalf of thomas.schatzl at oracle.com <mailto:thomas.schatzl at oracle.com>> wrote:

Hi,

not having a lot of experience with CMS, the following could be
potential issues here:

* this is a burstable VM and it sporadically does not get cpu time. Does
not seem so for such a large setup for me.

* I am not sure about CMS (it's really been a long time) but I think we
fixed some issue with large arrays of references not being well
distributed across threads in Parallel GC (fairly recently), and CMS
might have the same issue.

Not sure this would show up as all threads being busy for a long time.

Afair there is not much logging in ParNew/CMS either
(-XX:+PrintGCDetails may help though).

Maybe you know whether this can happen in your application.

A heap histogram may also help during that time (but

* maybe parallel reference processing not enabled. Use
-XX:+PrintReferenceGC / -XX:+ParallelRefProcEnabled to enable.

Maybe there is still somebody else out there more knowledgable about CMS
tuning/issues than me.

Hth,
Thomas


On 19.11.24 17:45, Shane Cox wrote:
> We are observing a transient behavior where Minor GCs consume an extreme
> amount of user CPU time. Below is an excerpt from a GC log illustrating
> this behavior. The first line is representative of typical behavior.
> Starting with the second line, we observe progressively longer Minor GC
> pause times and attendant user CPU usage. The last line shows GC
> behavior returning to normal.
>
> 2024-11-19T05:31:57.096-0500: 39910.586: [GC (Allocation Failure)
> 2024-11-19T05:31:57.096-0500: 39910.587: [ParNew:
> 2276397K->23593K(2530368K), 0.8039922 secs]
> 34156897K->32412975K(157005312K), 0.8042477 secs] [Times: user=26.09
> sys=0.28, real=0.81 secs]
>
> 2024-11-19T05:32:01.942-0500: 39915.432: [GC (Allocation Failure)
> 2024-11-19T05:32:01.942-0500: 39915.433: [ParNew:
> 2272873K->22588K(2530368K), 5.9797785 secs]
> 34662255K->32413062K(157005312K), 5.9800455 secs] [Times: user=195.82
> sys=0.16, real=5.98 secs]
>
> 2024-11-19T05:32:08.811-0500: 39922.302: [GC (Allocation Failure)
> 2024-11-19T05:32:08.812-0500: 39922.302: [ParNew:
> 2271868K->24057K(2530368K), 30.6785198 secs]
> 34662342K->32417812K(157005312K), 30.6788152 secs] [Times: user=1005.22
> sys=0.69, real=30.68 secs]
>
> 2024-11-19T05:32:40.245-0500: 39953.736: [GC (Allocation Failure)
> 2024-11-19T05:32:40.245-0500: 39953.736: [ParNew:
> 2273337K->41786K(2530368K), 48.4038572 secs]
> 34667092K->32436550K(157005312K), 48.4041764 secs] [Times: user=1586.51
> sys=0.85, real=48.41 secs]
>
> 2024-11-19T05:33:32.236-0500: 40005.726: [GC (Allocation Failure)
> 2024-11-19T05:33:32.236-0500: 40005.726: [ParNew:
> 2291066K->59232K(2530368K), 63.8453443 secs]
> 34685830K->32454926K(157005312K), 63.8456820 secs] [Times: user=2091.55
> sys=1.66, real=63.85 secs]
>
> 2024-11-19T05:34:36.698-0500: 40070.189: [GC (Allocation Failure)
> 2024-11-19T05:34:36.698-0500: 40070.189: [ParNew:
> 2308512K->74883K(2530368K), 78.2503259 secs]
> 34704206K->32471817K(157005312K), 78.2506265 secs] [Times: user=2565.48
> sys=2.21, real=78.25 secs]
>
> 2024-11-19T05:35:55.417-0500: 40148.908: [GC (Allocation Failure)
> 2024-11-19T05:35:55.418-0500: 40148.908: [ParNew:
> 2324163K->102130K(2530368K), 96.4663544 secs]
> 34721097K->32500405K(157005312K), 96.4668756 secs] [Times: user=3161.99
> sys=2.26, real=96.46 secs]
>
> 2024-11-19T05:37:32.503-0500: 40245.993: [GC (Allocation Failure)
> 2024-11-19T05:37:32.503-0500: 40245.993: [ParNew:
> 2351410K->108861K(2530368K), 122.1160632 secs]
> 34749685K->32508388K(157005312K), 122.1164558 secs] [Times: user=4001.43
> sys=3.68, real=122.12 secs]
>
> 2024-11-19T05:39:35.187-0500: 40368.678: [GC (Allocation Failure)
> 2024-11-19T05:39:35.188-0500: 40368.678: [ParNew:
> 2358141K->144672K(2530368K), 161.7914697 secs]
> 34757668K->32546599K(157005312K), 161.7916831 secs] [Times: user=5302.53
> sys=4.43, real=161.79 secs]
>
> 2024-11-19T05:42:17.358-0500: 40530.849: [GC (Allocation Failure)
> 2024-11-19T05:42:17.359-0500: 40530.849: [ParNew:
> 2393952K->200667K(2530368K), 279.7220815 secs]
> 34795879K->32612580K(157005312K), 279.7224630 secs] [Times: user=9164.75
> sys=8.37, real=279.72 secs]
>
> 2024-11-19T05:46:57.492-0500: 40810.983: [GC (Allocation Failure)
> 2024-11-19T05:46:57.492-0500: 40810.983: [ParNew:
> 2449947K->250145K(2530368K), 0.2150744 secs]
> 34861860K->32690960K(157005312K), 0.2152389 secs] [Times: user=5.89
> sys=0.70, real=0.22 secs]
>
> The GC log records above do not indicate that an excessive amount of
> GC-related work is being done. Therefore, I?m guessing that most of the
> user CPU time is not spent doing productive work. Instead, I?m guessing
> that the GC threads were just spinning.
>
> The CPU usage on the host was 68-69% during this period. That makes
> sense based on the GC log records. For example, the second to last GC
> log record reports 9164 CPU seconds and 279 real seconds, which yields a
> ratio of 32.85. Divide 32.85 by the 48 CPUs and you get a usage of
> 68.4% ? which is the CPU usage that we observed on the host at this
> time. That being the case, it appears that all of the observed CPU
> usage was consumed by the GC threads.
>
> Any idea what the GC threads might be doing that?s consuming so much CPU
> usage? Or suggestions for how I might determine what the GC threads are
> doing?
>
> Thanks,
>
> Shane Cox
>
> First few lines of GC log containing configuration information:
>
> OpenJDK 64-Bit Server VM (25.212-b04) for linux-amd64 JRE
> (1.8.0_212-b04), built on Apr 19 2019 23:23:17 by "root" with gcc 7.3.1
> 20180303 (Red Hat 7.3.1-5)
>
> Memory: 4k page, physical 196374708k(167484808k free), swap
> 4194300k(3010032k free)
>
> CommandLine flags: -XX:CMSInitiatingOccupancyFraction=75
> -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError
> -XX:InitialHeapSize=161061273600 -XX:MaxHeapSize=161061273600
> -XX:MaxNewSize=2878971904 -XX:MaxTenuringThreshold=6
> -XX:NewSize=2878971904 -XX:OldPLABSize=16 -XX:OldSize=5757943808
> -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
>
> uname output:
>
> Linux 4.18.0-305.45.1.el8_4.x86_64 #1 SMP Wed Apr 6 13:48:37 EDT 2022
> x86_64 x86_64 x86_64 GNU/Linux
>
> lscpu output:
>
> Architecture: x86_64
>
> CPU op-mode(s): 32-bit, 64-bit
>
> Byte Order: Little Endian
>
> CPU(s): 48
>
> On-line CPU(s) list: 0-47
>
> Thread(s) per core: 2
>
> Core(s) per socket: 12
>
> Socket(s): 2
>
> NUMA node(s): 2
>
> Vendor ID: GenuineIntel
>
> CPU family: 6
>
> Model: 85
>
> Model name: Intel(R) Xeon(R) Gold 6246 CPU @ 3.30GHz
>
> Stepping: 7
>
> CPU MHz: 4100.050
>
> CPU max MHz: 4200.0000
>
> CPU min MHz: 1200.0000
>
> BogoMIPS: 6600.00
>
> Virtualization: VT-x
>
> L1d cache: 32K
>
> L1i cache: 32K
>
> L2 cache: 1024K
>
> L3 cache: 25344K
>
> NUMA node0 CPU(s):
> 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
>
> NUMA node1 CPU(s):
> 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
>
>
> ------------------------------------------------------------------------
>
> This message may contain confidential information and is intended for
> specific recipients unless explicitly noted otherwise. If you have
> reason to believe you are not an intended recipient of this message,
> please delete it and notify the sender. This message may not represent
> the opinion of Intercontinental Exchange, Inc. (ICE), its subsidiaries
> or affiliates, and does not constitute a contract or guarantee.
> Unencrypted electronic mail is not secure and the recipient of this
> message is expected to provide safeguards from viruses and pursue
> alternate means of communication where privacy or a binding message is
> desired.
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.org <mailto:hotspot-gc-use at openjdk.org>
> https://mail.openjdk.org/mailman/listinfo/hotspot-gc-use <https://mail.openjdk.org/mailman/listinfo/hotspot-gc-use>


_______________________________________________
hotspot-gc-use mailing list
hotspot-gc-use at openjdk.org <mailto:hotspot-gc-use at openjdk.org>
https://mail.openjdk.org/mailman/listinfo/hotspot-gc-use <https://mail.openjdk.org/mailman/listinfo/hotspot-gc-use>


From charlie.hunt at oracle.com  Fri Nov 22 17:42:17 2024
From: charlie.hunt at oracle.com (charlie hunt)
Date: Fri, 22 Nov 2024 11:42:17 -0600
Subject: Minor GCs consuming an extreme amount of user CPU time
In-Reply-To: <dc2015f7-85ba-4a30-8a39-c1528af98d42@oracle.com>
References: <CYYPR22MB42623DEEC4445ABDB18E6C46F1202@CYYPR22MB4262.namprd22.prod.outlook.com>
 <dc2015f7-85ba-4a30-8a39-c1528af98d42@oracle.com>
Message-ID: <5e6d33c2-4bdb-43ae-99a0-ca33e6f5ed33@oracle.com>

To add to comments from Thomas, and Paul Hohensee, the GC logs also tend 
to show a similar pattern we have seen with CMS at times with 
transparent huge pages being set to "always". The symptoms tended to be 
occasional high user CPU time along with some sys CPU time.

So another thing that may be worth looking at is setting the OS's 
transparent huge pages enabled to "never", and also setting transparent 
huge pages defrag to "never".

Alternatively you could also try adding -XX:+UseTransparentHugePages 
-XX:+AlwaysPreTouch to set of JVM command line args instead of mucking 
with the OS's transparent huge pages configuration.

Charlie

On 11/22/24 2:50 AM, Thomas Schatzl wrote:
> Hi,
>
> ? not having a lot of experience with CMS, the following could be 
> potential issues here:
>
> * this is a burstable VM and it sporadically does not get cpu time. 
> Does not seem so for such a large setup for me.
>
> * I am not sure about CMS (it's really been a long time) but I think 
> we fixed some issue with large arrays of references not being well 
> distributed across threads in Parallel GC (fairly recently), and CMS 
> might have the same issue.
>
> Not sure this would show up as all threads being busy for a long time.
>
> Afair there is not much logging in ParNew/CMS either 
> (-XX:+PrintGCDetails may help though).
>
> Maybe you know whether this can happen in your application.
>
> A heap histogram may also help during that time (but
>
> * maybe parallel reference processing not enabled. Use 
> -XX:+PrintReferenceGC / -XX:+ParallelRefProcEnabled to enable.
>
> Maybe there is still somebody else out there more knowledgable about 
> CMS tuning/issues than me.
>
> Hth,
> ? Thomas
>
> On 19.11.24 17:45, Shane Cox wrote:
>> We are observing a transient behavior where Minor GCs consume an 
>> extreme amount of user CPU time. Below is an excerpt from a GC log 
>> illustrating this behavior. The first line is representative of 
>> typical behavior.? Starting with the second line, we observe 
>> progressively longer Minor GC pause times and attendant user CPU 
>> usage.? The last line shows GC behavior returning to normal.
>>
>> 2024-11-19T05:31:57.096-0500: 39910.586: [GC (Allocation Failure) 
>> 2024-11-19T05:31:57.096-0500: 39910.587: [ParNew: 
>> 2276397K->23593K(2530368K), 0.8039922 secs] 
>> 34156897K->32412975K(157005312K), 0.8042477 secs] [Times: user=26.09 
>> sys=0.28, real=0.81 secs]
>>
>> 2024-11-19T05:32:01.942-0500: 39915.432: [GC (Allocation Failure) 
>> 2024-11-19T05:32:01.942-0500: 39915.433: [ParNew: 
>> 2272873K->22588K(2530368K), 5.9797785 secs] 
>> 34662255K->32413062K(157005312K), 5.9800455 secs] [Times: user=195.82 
>> sys=0.16, real=5.98 secs]
>>
>> 2024-11-19T05:32:08.811-0500: 39922.302: [GC (Allocation Failure) 
>> 2024-11-19T05:32:08.812-0500: 39922.302: [ParNew: 
>> 2271868K->24057K(2530368K), 30.6785198 secs] 
>> 34662342K->32417812K(157005312K), 30.6788152 secs] [Times: 
>> user=1005.22 sys=0.69, real=30.68 secs]
>>
>> 2024-11-19T05:32:40.245-0500: 39953.736: [GC (Allocation Failure) 
>> 2024-11-19T05:32:40.245-0500: 39953.736: [ParNew: 
>> 2273337K->41786K(2530368K), 48.4038572 secs] 
>> 34667092K->32436550K(157005312K), 48.4041764 secs] [Times: 
>> user=1586.51 sys=0.85, real=48.41 secs]
>>
>> 2024-11-19T05:33:32.236-0500: 40005.726: [GC (Allocation Failure) 
>> 2024-11-19T05:33:32.236-0500: 40005.726: [ParNew: 
>> 2291066K->59232K(2530368K), 63.8453443 secs] 
>> 34685830K->32454926K(157005312K), 63.8456820 secs] [Times: 
>> user=2091.55 sys=1.66, real=63.85 secs]
>>
>> 2024-11-19T05:34:36.698-0500: 40070.189: [GC (Allocation Failure) 
>> 2024-11-19T05:34:36.698-0500: 40070.189: [ParNew: 
>> 2308512K->74883K(2530368K), 78.2503259 secs] 
>> 34704206K->32471817K(157005312K), 78.2506265 secs] [Times: 
>> user=2565.48 sys=2.21, real=78.25 secs]
>>
>> 2024-11-19T05:35:55.417-0500: 40148.908: [GC (Allocation Failure) 
>> 2024-11-19T05:35:55.418-0500: 40148.908: [ParNew: 
>> 2324163K->102130K(2530368K), 96.4663544 secs] 
>> 34721097K->32500405K(157005312K), 96.4668756 secs] [Times: 
>> user=3161.99 sys=2.26, real=96.46 secs]
>>
>> 2024-11-19T05:37:32.503-0500: 40245.993: [GC (Allocation Failure) 
>> 2024-11-19T05:37:32.503-0500: 40245.993: [ParNew: 
>> 2351410K->108861K(2530368K), 122.1160632 secs] 
>> 34749685K->32508388K(157005312K), 122.1164558 secs] [Times: 
>> user=4001.43 sys=3.68, real=122.12 secs]
>>
>> 2024-11-19T05:39:35.187-0500: 40368.678: [GC (Allocation Failure) 
>> 2024-11-19T05:39:35.188-0500: 40368.678: [ParNew: 
>> 2358141K->144672K(2530368K), 161.7914697 secs] 
>> 34757668K->32546599K(157005312K), 161.7916831 secs] [Times: 
>> user=5302.53 sys=4.43, real=161.79 secs]
>>
>> 2024-11-19T05:42:17.358-0500: 40530.849: [GC (Allocation Failure) 
>> 2024-11-19T05:42:17.359-0500: 40530.849: [ParNew: 
>> 2393952K->200667K(2530368K), 279.7220815 secs] 
>> 34795879K->32612580K(157005312K), 279.7224630 secs] [Times: 
>> user=9164.75 sys=8.37, real=279.72 secs]
>>
>> 2024-11-19T05:46:57.492-0500: 40810.983: [GC (Allocation Failure) 
>> 2024-11-19T05:46:57.492-0500: 40810.983: [ParNew: 
>> 2449947K->250145K(2530368K), 0.2150744 secs] 
>> 34861860K->32690960K(157005312K), 0.2152389 secs] [Times: user=5.89 
>> sys=0.70, real=0.22 secs]
>>
>> The GC log records above do not indicate that an excessive amount of 
>> GC-related work is being done.? Therefore, I?m guessing that most of 
>> the user CPU time is not spent doing productive work.? Instead, I?m 
>> guessing that the GC threads were just spinning.
>>
>> The CPU usage on the host was 68-69% during this period.? That makes 
>> sense based on the GC log records.? For example, the second to last 
>> GC log record reports 9164 CPU seconds and 279 real seconds, which 
>> yields a ratio of 32.85.? Divide 32.85 by the 48 CPUs and you get a 
>> usage of 68.4% ? which is the CPU usage that we observed on the host 
>> at this time.? That being the case, it appears that all of the 
>> observed CPU usage was consumed by the GC threads.
>>
>> Any idea what the GC threads might be doing that?s consuming so much 
>> CPU usage?? Or suggestions for how I might determine what the GC 
>> threads are doing?
>>
>> Thanks,
>>
>> Shane Cox
>>
>> First few lines of GC log containing configuration information:
>>
>> OpenJDK 64-Bit Server VM (25.212-b04) for linux-amd64 JRE 
>> (1.8.0_212-b04), built on Apr 19 2019 23:23:17 by "root" with gcc 
>> 7.3.1 20180303 (Red Hat 7.3.1-5)
>>
>> Memory: 4k page, physical 196374708k(167484808k free), swap 
>> 4194300k(3010032k free)
>>
>> CommandLine flags: -XX:CMSInitiatingOccupancyFraction=75 
>> -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError 
>> -XX:InitialHeapSize=161061273600 -XX:MaxHeapSize=161061273600 
>> -XX:MaxNewSize=2878971904 -XX:MaxTenuringThreshold=6 
>> -XX:NewSize=2878971904 -XX:OldPLABSize=16 -XX:OldSize=5757943808 
>> -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails 
>> -XX:+PrintGCTimeStamps -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
>>
>> uname output:
>>
>> Linux 4.18.0-305.45.1.el8_4.x86_64 #1 SMP Wed Apr 6 13:48:37 EDT 2022 
>> x86_64 x86_64 x86_64 GNU/Linux
>>
>> lscpu output:
>>
>> Architecture:??????? x86_64
>>
>> CPU op-mode(s):????? 32-bit, 64-bit
>>
>> Byte Order:????????? Little Endian
>>
>> CPU(s):????????????? 48
>>
>> On-line CPU(s) list: 0-47
>>
>> Thread(s) per core:? 2
>>
>> Core(s) per socket:? 12
>>
>> Socket(s):?????????? 2
>>
>> NUMA node(s):??????? 2
>>
>> Vendor ID:?????????? GenuineIntel
>>
>> CPU family:????????? 6
>>
>> Model:?????????????? 85
>>
>> Model name:????????? Intel(R) Xeon(R) Gold 6246 CPU @ 3.30GHz
>>
>> Stepping:??????????? 7
>>
>> CPU MHz:???????????? 4100.050
>>
>> CPU max MHz:???????? 4200.0000
>>
>> CPU min MHz:???????? 1200.0000
>>
>> BogoMIPS:??????????? 6600.00
>>
>> Virtualization:????? VT-x
>>
>> L1d cache:?????????? 32K
>>
>> L1i cache:?????????? 32K
>>
>> L2 cache:??????????? 1024K
>>
>> L3 cache:??????????? 25344K
>>
>> NUMA node0 CPU(s): 
>> 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
>>
>> NUMA node1 CPU(s): 
>> 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
>>
>>
>> ------------------------------------------------------------------------
>>
>> This message may contain confidential information and is intended for 
>> specific recipients unless explicitly noted otherwise. If you have 
>> reason to believe you are not an intended recipient of this message, 
>> please delete it and notify the sender. This message may not 
>> represent the opinion of Intercontinental Exchange, Inc. (ICE), its 
>> subsidiaries or affiliates, and does not constitute a contract or 
>> guarantee. Unencrypted electronic mail is not secure and the 
>> recipient of this message is expected to provide safeguards from 
>> viruses and pursue alternate means of communication where privacy or 
>> a binding message is desired.
>>
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.org
>> https://mail.openjdk.org/mailman/listinfo/hotspot-gc-use
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.org
> https://mail.openjdk.org/mailman/listinfo/hotspot-gc-use

From Shane.Cox at ice.com  Fri Nov 22 21:05:09 2024
From: Shane.Cox at ice.com (Shane Cox)
Date: Fri, 22 Nov 2024 21:05:09 +0000
Subject: Minor GCs consuming an extreme amount of user CPU time
In-Reply-To: <5e6d33c2-4bdb-43ae-99a0-ca33e6f5ed33@oracle.com>
References: <CYYPR22MB42623DEEC4445ABDB18E6C46F1202@CYYPR22MB4262.namprd22.prod.outlook.com>
 <dc2015f7-85ba-4a30-8a39-c1528af98d42@oracle.com>
 <5e6d33c2-4bdb-43ae-99a0-ca33e6f5ed33@oracle.com>
Message-ID: <CYYPR22MB4262B12C57348DA730D9DCD5F1232@CYYPR22MB4262.namprd22.prod.outlook.com>

Thank you Thomas and Charlie for your replies.  We are able to recreate this behavior in a test environment.  Per Thomas' suggestion, we enabled reference GC logging.  Unfortunately that does not appear to be the issue.  Here are example GC log records from our most recent test.

2024-11-22T15:14:44.859-0500: 6362.195: [GC (Allocation Failure) 2024-11-22T15:14:44.859-0500: 6362.195: [ParNew2024-11-22T15:14:51.706-0500: 6369.042: [SoftReference, 0 refs, 0.0000497 secs]2024-11-22T15:14:51.706-0500: 6369.042: [WeakReference, 0 refs, 0.0000160 secs]2024-11-22T15:14:51.706-0500: 6369.042: [FinalReference, 27 refs, 0.0000287 secs]2024-11-22T15:14:51.706-0500: 6369.042: [PhantomReference, 0 refs, 0 refs, 0.0000095 secs]2024-11-22T15:14:51.706-0500: 6369.042: [JNI Weak Reference, 0.0000146 secs]: 2314677K->69934K(2530368K), 6.8470053 secs] 17427466K->15692635K(157005312K), 6.8471532 secs] [Times: user=224.97 sys=0.14, real=6.85 secs]
2024-11-22T15:14:52.175-0500: 6369.511: [GC (Allocation Failure) 2024-11-22T15:14:52.175-0500: 6369.511: [ParNew2024-11-22T15:17:09.061-0500: 6506.397: [SoftReference, 0 refs, 0.0000562 secs]2024-11-22T15:17:09.061-0500: 6506.397: [WeakReference, 1 refs, 0.0000164 secs]2024-11-22T15:17:09.061-0500: 6506.397: [FinalReference, 67 refs, 0.0000326 secs]2024-11-22T15:17:09.061-0500: 6506.397: [PhantomReference, 0 refs, 0 refs, 0.0000094 secs]2024-11-22T15:17:09.061-0500: 6506.397: [JNI Weak Reference, 0.0000159 secs]: 2319214K->71352K(2530368K), 136.8858984 secs] 17941915K->15706880K(157005312K), 136.8860494 secs] [Times: user=4481.92 sys=17.19, real=136.88 secs]
2024-11-22T15:17:09.419-0500: 6506.755: [GC (Allocation Failure) 2024-11-22T15:17:09.419-0500: 6506.755: [ParNew2024-11-22T15:22:08.800-0500: 6806.137: [SoftReference, 0 refs, 0.0000570 secs]2024-11-22T15:22:08.800-0500: 6806.137: [WeakReference, 15 refs, 0.0000107 secs]2024-11-22T15:22:08.800-0500: 6806.137: [FinalReference, 931 refs, 0.0003795 secs]2024-11-22T15:22:08.801-0500: 6806.137: [PhantomReference, 0 refs, 0 refs, 0.0000108 secs]2024-11-22T15:22:08.801-0500: 6806.137: [JNI Weak Reference, 0.0000218 secs]: 2320632K->256239K(2530368K), 299.3821501 secs] 17956160K->15901644K(157005312K), 299.3823927 secs] [Times: user=9764.41 sys=80.57, real=299.38 secs]


Per Charlie's advice, I verified that transparent huge pages are disabled:
> cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]


Since we're able to recreate the problem, I verified that all of the GC GangWorker threads were 100% busy when the problem occurs.  In addition, I ran the Linux perf command while the problem was occurring to gather more information.  Sample output from a 30 second profiling session is below.

Samples: 999K of event 'cycles', Event count (approx.): 3877492515135
  Children      Self  Command  Shared Object       Symbol
+   99.26%     0.00%  java     libpthread-2.28.so  [.] start_thread
+   99.26%     0.00%  java     libjvm.so           [.] java_start
+   99.26%     0.00%  java     libjvm.so           [.] GangWorker::loop
+   99.26%     0.00%  java     libjvm.so           [.] ParNewGenTask::work
+   99.26%     0.00%  java     libjvm.so           [.] GenCollectedHeap::gen_process_roots
+   99.26%     0.00%  java     libjvm.so           [.] ConcurrentMarkSweepGeneration::younger_refs_iterate
+   99.26%     0.00%  java     libjvm.so           [.] CardTableRS::younger_refs_in_space_iterate
+   99.26%     0.00%  java     libjvm.so           [.] CardTableModRefBS::non_clean_card_iterate_possibly_parallel
+   99.26%     0.00%  java     libjvm.so           [.] CardTableModRefBS::non_clean_card_iterate_parallel_work
+   99.25%     0.01%  java     libjvm.so           [.] CardTableModRefBS::process_stride
+   86.82%    79.91%  java     libjvm.so           [.] CompactibleFreeListSpace::block_size
+   83.55%    18.61%  java     libjvm.so           [.] BlockOffsetArrayNonContigSpace::block_start_unsafe


Nearly 80% of the CPU usage was spent in the function CompactibleFreeListSpace::block_size.  Most of the remaining time was spent in BlockOffsetArrayNonContigSpace::block_start_unsafe.


I'm not familiar enough with the JVM code to understand what might be the source of the issue.


-----Original Message-----
From: hotspot-gc-use <hotspot-gc-use-retn at openjdk.org> On Behalf Of charlie hunt
Sent: Friday, November 22, 2024 12:42 PM
To: hotspot-gc-use at openjdk.org
Subject: Re: Minor GCs consuming an extreme amount of user CPU time

WARNING - External email from openjdk.org

To add to comments from Thomas, and Paul Hohensee, the GC logs also tend to show a similar pattern we have seen with CMS at times with transparent huge pages being set to "always". The symptoms tended to be occasional high user CPU time along with some sys CPU time.

So another thing that may be worth looking at is setting the OS's transparent huge pages enabled to "never", and also setting transparent huge pages defrag to "never".

Alternatively you could also try adding -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch to set of JVM command line args instead of mucking with the OS's transparent huge pages configuration.

Charlie

On 11/22/24 2:50 AM, Thomas Schatzl wrote:
> Hi,
>
>   not having a lot of experience with CMS, the following could be
> potential issues here:
>
> * this is a burstable VM and it sporadically does not get cpu time.
> Does not seem so for such a large setup for me.
>
> * I am not sure about CMS (it's really been a long time) but I think
> we fixed some issue with large arrays of references not being well
> distributed across threads in Parallel GC (fairly recently), and CMS
> might have the same issue.
>
> Not sure this would show up as all threads being busy for a long time.
>
> Afair there is not much logging in ParNew/CMS either
> (-XX:+PrintGCDetails may help though).
>
> Maybe you know whether this can happen in your application.
>
> A heap histogram may also help during that time (but
>
> * maybe parallel reference processing not enabled. Use
> -XX:+PrintReferenceGC / -XX:+ParallelRefProcEnabled to enable.
>
> Maybe there is still somebody else out there more knowledgable about
> CMS tuning/issues than me.
>
> Hth,
>   Thomas
>
> On 19.11.24 17:45, Shane Cox wrote:
>> We are observing a transient behavior where Minor GCs consume an
>> extreme amount of user CPU time. Below is an excerpt from a GC log
>> illustrating this behavior. The first line is representative of
>> typical behavior.  Starting with the second line, we observe
>> progressively longer Minor GC pause times and attendant user CPU
>> usage.  The last line shows GC behavior returning to normal.
>>
>> 2024-11-19T05:31:57.096-0500: 39910.586: [GC (Allocation Failure)
>> 2024-11-19T05:31:57.096-0500: 39910.587: [ParNew:
>> 2276397K->23593K(2530368K), 0.8039922 secs]
>> 34156897K->32412975K(157005312K), 0.8042477 secs] [Times: user=26.09
>> sys=0.28, real=0.81 secs]
>>
>> 2024-11-19T05:32:01.942-0500: 39915.432: [GC (Allocation Failure)
>> 2024-11-19T05:32:01.942-0500: 39915.433: [ParNew:
>> 2272873K->22588K(2530368K), 5.9797785 secs]
>> 34662255K->32413062K(157005312K), 5.9800455 secs] [Times: user=195.82
>> sys=0.16, real=5.98 secs]
>>
>> 2024-11-19T05:32:08.811-0500: 39922.302: [GC (Allocation Failure)
>> 2024-11-19T05:32:08.812-0500: 39922.302: [ParNew:
>> 2271868K->24057K(2530368K), 30.6785198 secs]
>> 34662342K->32417812K(157005312K), 30.6788152 secs] [Times:
>> user=1005.22 sys=0.69, real=30.68 secs]
>>
>> 2024-11-19T05:32:40.245-0500: 39953.736: [GC (Allocation Failure)
>> 2024-11-19T05:32:40.245-0500: 39953.736: [ParNew:
>> 2273337K->41786K(2530368K), 48.4038572 secs]
>> 34667092K->32436550K(157005312K), 48.4041764 secs] [Times:
>> user=1586.51 sys=0.85, real=48.41 secs]
>>
>> 2024-11-19T05:33:32.236-0500: 40005.726: [GC (Allocation Failure)
>> 2024-11-19T05:33:32.236-0500: 40005.726: [ParNew:
>> 2291066K->59232K(2530368K), 63.8453443 secs]
>> 34685830K->32454926K(157005312K), 63.8456820 secs] [Times:
>> user=2091.55 sys=1.66, real=63.85 secs]
>>
>> 2024-11-19T05:34:36.698-0500: 40070.189: [GC (Allocation Failure)
>> 2024-11-19T05:34:36.698-0500: 40070.189: [ParNew:
>> 2308512K->74883K(2530368K), 78.2503259 secs]
>> 34704206K->32471817K(157005312K), 78.2506265 secs] [Times:
>> user=2565.48 sys=2.21, real=78.25 secs]
>>
>> 2024-11-19T05:35:55.417-0500: 40148.908: [GC (Allocation Failure)
>> 2024-11-19T05:35:55.418-0500: 40148.908: [ParNew:
>> 2324163K->102130K(2530368K), 96.4663544 secs]
>> 34721097K->32500405K(157005312K), 96.4668756 secs] [Times:
>> user=3161.99 sys=2.26, real=96.46 secs]
>>
>> 2024-11-19T05:37:32.503-0500: 40245.993: [GC (Allocation Failure)
>> 2024-11-19T05:37:32.503-0500: 40245.993: [ParNew:
>> 2351410K->108861K(2530368K), 122.1160632 secs]
>> 34749685K->32508388K(157005312K), 122.1164558 secs] [Times:
>> user=4001.43 sys=3.68, real=122.12 secs]
>>
>> 2024-11-19T05:39:35.187-0500: 40368.678: [GC (Allocation Failure)
>> 2024-11-19T05:39:35.188-0500: 40368.678: [ParNew:
>> 2358141K->144672K(2530368K), 161.7914697 secs]
>> 34757668K->32546599K(157005312K), 161.7916831 secs] [Times:
>> user=5302.53 sys=4.43, real=161.79 secs]
>>
>> 2024-11-19T05:42:17.358-0500: 40530.849: [GC (Allocation Failure)
>> 2024-11-19T05:42:17.359-0500: 40530.849: [ParNew:
>> 2393952K->200667K(2530368K), 279.7220815 secs]
>> 34795879K->32612580K(157005312K), 279.7224630 secs] [Times:
>> user=9164.75 sys=8.37, real=279.72 secs]
>>
>> 2024-11-19T05:46:57.492-0500: 40810.983: [GC (Allocation Failure)
>> 2024-11-19T05:46:57.492-0500: 40810.983: [ParNew:
>> 2449947K->250145K(2530368K), 0.2150744 secs]
>> 34861860K->32690960K(157005312K), 0.2152389 secs] [Times: user=5.89
>> sys=0.70, real=0.22 secs]
>>
>> The GC log records above do not indicate that an excessive amount of
>> GC-related work is being done.  Therefore, I?m guessing that most of
>> the user CPU time is not spent doing productive work.  Instead, I?m
>> guessing that the GC threads were just spinning.
>>
>> The CPU usage on the host was 68-69% during this period.  That makes
>> sense based on the GC log records.  For example, the second to last
>> GC log record reports 9164 CPU seconds and 279 real seconds, which
>> yields a ratio of 32.85.  Divide 32.85 by the 48 CPUs and you get a
>> usage of 68.4% ? which is the CPU usage that we observed on the host
>> at this time.  That being the case, it appears that all of the
>> observed CPU usage was consumed by the GC threads.
>>
>> Any idea what the GC threads might be doing that?s consuming so much
>> CPU usage?  Or suggestions for how I might determine what the GC
>> threads are doing?
>>
>> Thanks,
>>
>> Shane Cox
>>
>> First few lines of GC log containing configuration information:
>>
>> OpenJDK 64-Bit Server VM (25.212-b04) for linux-amd64 JRE
>> (1.8.0_212-b04), built on Apr 19 2019 23:23:17 by "root" with gcc
>> 7.3.1 20180303 (Red Hat 7.3.1-5)
>>
>> Memory: 4k page, physical 196374708k(167484808k free), swap
>> 4194300k(3010032k free)
>>
>> CommandLine flags: -XX:CMSInitiatingOccupancyFraction=75
>> -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError
>> -XX:InitialHeapSize=161061273600 -XX:MaxHeapSize=161061273600
>> -XX:MaxNewSize=2878971904 -XX:MaxTenuringThreshold=6
>> -XX:NewSize=2878971904 -XX:OldPLABSize=16 -XX:OldSize=5757943808
>> -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails
>> -XX:+PrintGCTimeStamps -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
>>
>> uname output:
>>
>> Linux 4.18.0-305.45.1.el8_4.x86_64 #1 SMP Wed Apr 6 13:48:37 EDT 2022
>> x86_64 x86_64 x86_64 GNU/Linux
>>
>> lscpu output:
>>
>> Architecture:        x86_64
>>
>> CPU op-mode(s):      32-bit, 64-bit
>>
>> Byte Order:          Little Endian
>>
>> CPU(s):              48
>>
>> On-line CPU(s) list: 0-47
>>
>> Thread(s) per core:  2
>>
>> Core(s) per socket:  12
>>
>> Socket(s):           2
>>
>> NUMA node(s):        2
>>
>> Vendor ID:           GenuineIntel
>>
>> CPU family:          6
>>
>> Model:               85
>>
>> Model name:          Intel(R) Xeon(R) Gold 6246 CPU @ 3.30GHz
>>
>> Stepping:            7
>>
>> CPU MHz:             4100.050
>>
>> CPU max MHz:         4200.0000
>>
>> CPU min MHz:         1200.0000
>>
>> BogoMIPS:            6600.00
>>
>> Virtualization:      VT-x
>>
>> L1d cache:           32K
>>
>> L1i cache:           32K
>>
>> L2 cache:            1024K
>>
>> L3 cache:            25344K
>>
>> NUMA node0 CPU(s):
>> 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
>>
>> NUMA node1 CPU(s):
>> 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
>>
>>
>> ---------------------------------------------------------------------
>> ---
>>
>> This message may contain confidential information and is intended for
>> specific recipients unless explicitly noted otherwise. If you have
>> reason to believe you are not an intended recipient of this message,
>> please delete it and notify the sender. This message may not
>> represent the opinion of Intercontinental Exchange, Inc. (ICE), its
>> subsidiaries or affiliates, and does not constitute a contract or
>> guarantee. Unencrypted electronic mail is not secure and the
>> recipient of this message is expected to provide safeguards from
>> viruses and pursue alternate means of communication where privacy or
>> a binding message is desired.
>>
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.org
>> https://secure.ice/?https://mail.openjdk.org/mailman/listinfo/hotspot
>> -gc-use
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.org
> https://secure.ice/?https://mail.openjdk.org/mailman/listinfo/hotspot-
> gc-use
_______________________________________________
hotspot-gc-use mailing list
hotspot-gc-use at openjdk.org
https://secure.ice/?https://mail.openjdk.org/mailman/listinfo/hotspot-gc-use

________________________________

This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of Intercontinental Exchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.

From Shane.Cox at ice.com  Mon Nov 25 19:41:38 2024
From: Shane.Cox at ice.com (Shane Cox)
Date: Mon, 25 Nov 2024 19:41:38 +0000
Subject: Minor GCs consuming an extreme amount of user CPU time
In-Reply-To: <CYYPR22MB4262B12C57348DA730D9DCD5F1232@CYYPR22MB4262.namprd22.prod.outlook.com>
References: <CYYPR22MB42623DEEC4445ABDB18E6C46F1202@CYYPR22MB4262.namprd22.prod.outlook.com>
 <dc2015f7-85ba-4a30-8a39-c1528af98d42@oracle.com>
 <5e6d33c2-4bdb-43ae-99a0-ca33e6f5ed33@oracle.com>
 <CYYPR22MB4262B12C57348DA730D9DCD5F1232@CYYPR22MB4262.namprd22.prod.outlook.com>
Message-ID: <CYYPR22MB4262EAA9E856E282E377735EF12E2@CYYPR22MB4262.namprd22.prod.outlook.com>

All 33 Gang Worker threads are executing CompactibleFreeListSpace::block_size for multiple consecutive minutes when this problem occurs.  Looking at the source, this function executes a while loop, first checking if the block if free.  If it's not free (i.e., initialized), then it checks the contents of the block using the function klass_or_null.  If klass_or_null returns NULL, then the while loop is repeated.

I see that in Java 9, there was a requirement to eliminate ParNew's use of klass_or_null:  https://bugs.openjdk.org/browse/JDK-8166229

This requirement related to a bug that Thomas worked on back in 2016:  https://bugs.openjdk.org/browse/JDK-8160369

I'm wondering if we're encountering a use case where an object header has been initialized, but the klass_or_null function does not have a clear view of the object and repeatedly returns NULL, causing the looping behavior that we observe.  Does anyone know how I could determine if this is the cause?


-----Original Message-----
From: Shane Cox
Sent: Friday, November 22, 2024 4:05 PM
To: hotspot-gc-use at openjdk.org
Subject: RE: Minor GCs consuming an extreme amount of user CPU time

Thank you Thomas and Charlie for your replies.  We are able to recreate this behavior in a test environment.  Per Thomas' suggestion, we enabled reference GC logging.  Unfortunately that does not appear to be the issue.  Here are example GC log records from our most recent test.

2024-11-22T15:14:44.859-0500: 6362.195: [GC (Allocation Failure) 2024-11-22T15:14:44.859-0500: 6362.195: [ParNew2024-11-22T15:14:51.706-0500: 6369.042: [SoftReference, 0 refs, 0.0000497 secs]2024-11-22T15:14:51.706-0500: 6369.042: [WeakReference, 0 refs, 0.0000160 secs]2024-11-22T15:14:51.706-0500: 6369.042: [FinalReference, 27 refs, 0.0000287 secs]2024-11-22T15:14:51.706-0500: 6369.042: [PhantomReference, 0 refs, 0 refs, 0.0000095 secs]2024-11-22T15:14:51.706-0500: 6369.042: [JNI Weak Reference, 0.0000146 secs]: 2314677K->69934K(2530368K), 6.8470053 secs] 17427466K->15692635K(157005312K), 6.8471532 secs] [Times: user=224.97 sys=0.14, real=6.85 secs]
2024-11-22T15:14:52.175-0500: 6369.511: [GC (Allocation Failure) 2024-11-22T15:14:52.175-0500: 6369.511: [ParNew2024-11-22T15:17:09.061-0500: 6506.397: [SoftReference, 0 refs, 0.0000562 secs]2024-11-22T15:17:09.061-0500: 6506.397: [WeakReference, 1 refs, 0.0000164 secs]2024-11-22T15:17:09.061-0500: 6506.397: [FinalReference, 67 refs, 0.0000326 secs]2024-11-22T15:17:09.061-0500: 6506.397: [PhantomReference, 0 refs, 0 refs, 0.0000094 secs]2024-11-22T15:17:09.061-0500: 6506.397: [JNI Weak Reference, 0.0000159 secs]: 2319214K->71352K(2530368K), 136.8858984 secs] 17941915K->15706880K(157005312K), 136.8860494 secs] [Times: user=4481.92 sys=17.19, real=136.88 secs]
2024-11-22T15:17:09.419-0500: 6506.755: [GC (Allocation Failure) 2024-11-22T15:17:09.419-0500: 6506.755: [ParNew2024-11-22T15:22:08.800-0500: 6806.137: [SoftReference, 0 refs, 0.0000570 secs]2024-11-22T15:22:08.800-0500: 6806.137: [WeakReference, 15 refs, 0.0000107 secs]2024-11-22T15:22:08.800-0500: 6806.137: [FinalReference, 931 refs, 0.0003795 secs]2024-11-22T15:22:08.801-0500: 6806.137: [PhantomReference, 0 refs, 0 refs, 0.0000108 secs]2024-11-22T15:22:08.801-0500: 6806.137: [JNI Weak Reference, 0.0000218 secs]: 2320632K->256239K(2530368K), 299.3821501 secs] 17956160K->15901644K(157005312K), 299.3823927 secs] [Times: user=9764.41 sys=80.57, real=299.38 secs]


Per Charlie's advice, I verified that transparent huge pages are disabled:
> cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]


Since we're able to recreate the problem, I verified that all of the GC GangWorker threads were 100% busy when the problem occurs.  In addition, I ran the Linux perf command while the problem was occurring to gather more information.  Sample output from a 30 second profiling session is below.

Samples: 999K of event 'cycles', Event count (approx.): 3877492515135
  Children      Self  Command  Shared Object       Symbol
+   99.26%     0.00%  java     libpthread-2.28.so  [.] start_thread
+   99.26%     0.00%  java     libjvm.so           [.] java_start
+   99.26%     0.00%  java     libjvm.so           [.] GangWorker::loop
+   99.26%     0.00%  java     libjvm.so           [.] ParNewGenTask::work
+   99.26%     0.00%  java     libjvm.so           [.] GenCollectedHeap::gen_process_roots
+   99.26%     0.00%  java     libjvm.so           [.] ConcurrentMarkSweepGeneration::younger_refs_iterate
+   99.26%     0.00%  java     libjvm.so           [.] CardTableRS::younger_refs_in_space_iterate
+   99.26%     0.00%  java     libjvm.so           [.] CardTableModRefBS::non_clean_card_iterate_possibly_parallel
+   99.26%     0.00%  java     libjvm.so           [.] CardTableModRefBS::non_clean_card_iterate_parallel_work
+   99.25%     0.01%  java     libjvm.so           [.] CardTableModRefBS::process_stride
+   86.82%    79.91%  java     libjvm.so           [.] CompactibleFreeListSpace::block_size
+   83.55%    18.61%  java     libjvm.so           [.] BlockOffsetArrayNonContigSpace::block_start_unsafe


Nearly 80% of the CPU usage was spent in the function CompactibleFreeListSpace::block_size.  Most of the remaining time was spent in BlockOffsetArrayNonContigSpace::block_start_unsafe.


I'm not familiar enough with the JVM code to understand what might be the source of the issue.


-----Original Message-----
From: hotspot-gc-use <hotspot-gc-use-retn at openjdk.org> On Behalf Of charlie hunt
Sent: Friday, November 22, 2024 12:42 PM
To: hotspot-gc-use at openjdk.org
Subject: Re: Minor GCs consuming an extreme amount of user CPU time

WARNING - External email from openjdk.org

To add to comments from Thomas, and Paul Hohensee, the GC logs also tend to show a similar pattern we have seen with CMS at times with transparent huge pages being set to "always". The symptoms tended to be occasional high user CPU time along with some sys CPU time.

So another thing that may be worth looking at is setting the OS's transparent huge pages enabled to "never", and also setting transparent huge pages defrag to "never".

Alternatively you could also try adding -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch to set of JVM command line args instead of mucking with the OS's transparent huge pages configuration.

Charlie

On 11/22/24 2:50 AM, Thomas Schatzl wrote:
> Hi,
>
>   not having a lot of experience with CMS, the following could be
> potential issues here:
>
> * this is a burstable VM and it sporadically does not get cpu time.
> Does not seem so for such a large setup for me.
>
> * I am not sure about CMS (it's really been a long time) but I think
> we fixed some issue with large arrays of references not being well
> distributed across threads in Parallel GC (fairly recently), and CMS
> might have the same issue.
>
> Not sure this would show up as all threads being busy for a long time.
>
> Afair there is not much logging in ParNew/CMS either
> (-XX:+PrintGCDetails may help though).
>
> Maybe you know whether this can happen in your application.
>
> A heap histogram may also help during that time (but
>
> * maybe parallel reference processing not enabled. Use
> -XX:+PrintReferenceGC / -XX:+ParallelRefProcEnabled to enable.
>
> Maybe there is still somebody else out there more knowledgable about
> CMS tuning/issues than me.
>
> Hth,
>   Thomas
>
> On 19.11.24 17:45, Shane Cox wrote:
>> We are observing a transient behavior where Minor GCs consume an
>> extreme amount of user CPU time. Below is an excerpt from a GC log
>> illustrating this behavior. The first line is representative of
>> typical behavior.  Starting with the second line, we observe
>> progressively longer Minor GC pause times and attendant user CPU
>> usage.  The last line shows GC behavior returning to normal.
>>
>> 2024-11-19T05:31:57.096-0500: 39910.586: [GC (Allocation Failure)
>> 2024-11-19T05:31:57.096-0500: 39910.587: [ParNew:
>> 2276397K->23593K(2530368K), 0.8039922 secs]
>> 34156897K->32412975K(157005312K), 0.8042477 secs] [Times: user=26.09
>> sys=0.28, real=0.81 secs]
>>
>> 2024-11-19T05:32:01.942-0500: 39915.432: [GC (Allocation Failure)
>> 2024-11-19T05:32:01.942-0500: 39915.433: [ParNew:
>> 2272873K->22588K(2530368K), 5.9797785 secs]
>> 34662255K->32413062K(157005312K), 5.9800455 secs] [Times: user=195.82
>> sys=0.16, real=5.98 secs]
>>
>> 2024-11-19T05:32:08.811-0500: 39922.302: [GC (Allocation Failure)
>> 2024-11-19T05:32:08.812-0500: 39922.302: [ParNew:
>> 2271868K->24057K(2530368K), 30.6785198 secs]
>> 34662342K->32417812K(157005312K), 30.6788152 secs] [Times:
>> user=1005.22 sys=0.69, real=30.68 secs]
>>
>> 2024-11-19T05:32:40.245-0500: 39953.736: [GC (Allocation Failure)
>> 2024-11-19T05:32:40.245-0500: 39953.736: [ParNew:
>> 2273337K->41786K(2530368K), 48.4038572 secs]
>> 34667092K->32436550K(157005312K), 48.4041764 secs] [Times:
>> user=1586.51 sys=0.85, real=48.41 secs]
>>
>> 2024-11-19T05:33:32.236-0500: 40005.726: [GC (Allocation Failure)
>> 2024-11-19T05:33:32.236-0500: 40005.726: [ParNew:
>> 2291066K->59232K(2530368K), 63.8453443 secs]
>> 34685830K->32454926K(157005312K), 63.8456820 secs] [Times:
>> user=2091.55 sys=1.66, real=63.85 secs]
>>
>> 2024-11-19T05:34:36.698-0500: 40070.189: [GC (Allocation Failure)
>> 2024-11-19T05:34:36.698-0500: 40070.189: [ParNew:
>> 2308512K->74883K(2530368K), 78.2503259 secs]
>> 34704206K->32471817K(157005312K), 78.2506265 secs] [Times:
>> user=2565.48 sys=2.21, real=78.25 secs]
>>
>> 2024-11-19T05:35:55.417-0500: 40148.908: [GC (Allocation Failure)
>> 2024-11-19T05:35:55.418-0500: 40148.908: [ParNew:
>> 2324163K->102130K(2530368K), 96.4663544 secs]
>> 34721097K->32500405K(157005312K), 96.4668756 secs] [Times:
>> user=3161.99 sys=2.26, real=96.46 secs]
>>
>> 2024-11-19T05:37:32.503-0500: 40245.993: [GC (Allocation Failure)
>> 2024-11-19T05:37:32.503-0500: 40245.993: [ParNew:
>> 2351410K->108861K(2530368K), 122.1160632 secs]
>> 34749685K->32508388K(157005312K), 122.1164558 secs] [Times:
>> user=4001.43 sys=3.68, real=122.12 secs]
>>
>> 2024-11-19T05:39:35.187-0500: 40368.678: [GC (Allocation Failure)
>> 2024-11-19T05:39:35.188-0500: 40368.678: [ParNew:
>> 2358141K->144672K(2530368K), 161.7914697 secs]
>> 34757668K->32546599K(157005312K), 161.7916831 secs] [Times:
>> user=5302.53 sys=4.43, real=161.79 secs]
>>
>> 2024-11-19T05:42:17.358-0500: 40530.849: [GC (Allocation Failure)
>> 2024-11-19T05:42:17.359-0500: 40530.849: [ParNew:
>> 2393952K->200667K(2530368K), 279.7220815 secs]
>> 34795879K->32612580K(157005312K), 279.7224630 secs] [Times:
>> user=9164.75 sys=8.37, real=279.72 secs]
>>
>> 2024-11-19T05:46:57.492-0500: 40810.983: [GC (Allocation Failure)
>> 2024-11-19T05:46:57.492-0500: 40810.983: [ParNew:
>> 2449947K->250145K(2530368K), 0.2150744 secs]
>> 34861860K->32690960K(157005312K), 0.2152389 secs] [Times: user=5.89
>> sys=0.70, real=0.22 secs]
>>
>> The GC log records above do not indicate that an excessive amount of
>> GC-related work is being done.  Therefore, I?m guessing that most of
>> the user CPU time is not spent doing productive work.  Instead, I?m
>> guessing that the GC threads were just spinning.
>>
>> The CPU usage on the host was 68-69% during this period.  That makes
>> sense based on the GC log records.  For example, the second to last
>> GC log record reports 9164 CPU seconds and 279 real seconds, which
>> yields a ratio of 32.85.  Divide 32.85 by the 48 CPUs and you get a
>> usage of 68.4% ? which is the CPU usage that we observed on the host
>> at this time.  That being the case, it appears that all of the
>> observed CPU usage was consumed by the GC threads.
>>
>> Any idea what the GC threads might be doing that?s consuming so much
>> CPU usage?  Or suggestions for how I might determine what the GC
>> threads are doing?
>>
>> Thanks,
>>
>> Shane Cox
>>
>> First few lines of GC log containing configuration information:
>>
>> OpenJDK 64-Bit Server VM (25.212-b04) for linux-amd64 JRE
>> (1.8.0_212-b04), built on Apr 19 2019 23:23:17 by "root" with gcc
>> 7.3.1 20180303 (Red Hat 7.3.1-5)
>>
>> Memory: 4k page, physical 196374708k(167484808k free), swap
>> 4194300k(3010032k free)
>>
>> CommandLine flags: -XX:CMSInitiatingOccupancyFraction=75
>> -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError
>> -XX:InitialHeapSize=161061273600 -XX:MaxHeapSize=161061273600
>> -XX:MaxNewSize=2878971904 -XX:MaxTenuringThreshold=6
>> -XX:NewSize=2878971904 -XX:OldPLABSize=16 -XX:OldSize=5757943808
>> -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails
>> -XX:+PrintGCTimeStamps -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
>>
>> uname output:
>>
>> Linux 4.18.0-305.45.1.el8_4.x86_64 #1 SMP Wed Apr 6 13:48:37 EDT 2022
>> x86_64 x86_64 x86_64 GNU/Linux
>>
>> lscpu output:
>>
>> Architecture:        x86_64
>>
>> CPU op-mode(s):      32-bit, 64-bit
>>
>> Byte Order:          Little Endian
>>
>> CPU(s):              48
>>
>> On-line CPU(s) list: 0-47
>>
>> Thread(s) per core:  2
>>
>> Core(s) per socket:  12
>>
>> Socket(s):           2
>>
>> NUMA node(s):        2
>>
>> Vendor ID:           GenuineIntel
>>
>> CPU family:          6
>>
>> Model:               85
>>
>> Model name:          Intel(R) Xeon(R) Gold 6246 CPU @ 3.30GHz
>>
>> Stepping:            7
>>
>> CPU MHz:             4100.050
>>
>> CPU max MHz:         4200.0000
>>
>> CPU min MHz:         1200.0000
>>
>> BogoMIPS:            6600.00
>>
>> Virtualization:      VT-x
>>
>> L1d cache:           32K
>>
>> L1i cache:           32K
>>
>> L2 cache:            1024K
>>
>> L3 cache:            25344K
>>
>> NUMA node0 CPU(s):
>> 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
>>
>> NUMA node1 CPU(s):
>> 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
>>
>>
>> ---------------------------------------------------------------------
>> ---
>>
>> This message may contain confidential information and is intended for
>> specific recipients unless explicitly noted otherwise. If you have
>> reason to believe you are not an intended recipient of this message,
>> please delete it and notify the sender. This message may not
>> represent the opinion of Intercontinental Exchange, Inc. (ICE), its
>> subsidiaries or affiliates, and does not constitute a contract or
>> guarantee. Unencrypted electronic mail is not secure and the
>> recipient of this message is expected to provide safeguards from
>> viruses and pursue alternate means of communication where privacy or
>> a binding message is desired.
>>
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.org
>> https://secure.ice/?https://mail.openjdk.org/mailman/listinfo/hotspot
>> -gc-use
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.org
> https://secure.ice/?https://mail.openjdk.org/mailman/listinfo/hotspot-
> gc-use
_______________________________________________
hotspot-gc-use mailing list
hotspot-gc-use at openjdk.org
https://secure.ice/?https://mail.openjdk.org/mailman/listinfo/hotspot-gc-use

________________________________

This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of Intercontinental Exchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.

From Shane.Cox at ice.com  Wed Nov 27 17:36:36 2024
From: Shane.Cox at ice.com (Shane Cox)
Date: Wed, 27 Nov 2024 17:36:36 +0000
Subject: Minor GCs consuming an extreme amount of user CPU time
In-Reply-To: <CYYPR22MB4262EAA9E856E282E377735EF12E2@CYYPR22MB4262.namprd22.prod.outlook.com>
References: <CYYPR22MB42623DEEC4445ABDB18E6C46F1202@CYYPR22MB4262.namprd22.prod.outlook.com>
 <dc2015f7-85ba-4a30-8a39-c1528af98d42@oracle.com>
 <5e6d33c2-4bdb-43ae-99a0-ca33e6f5ed33@oracle.com>
 <CYYPR22MB4262B12C57348DA730D9DCD5F1232@CYYPR22MB4262.namprd22.prod.outlook.com>
 <CYYPR22MB4262EAA9E856E282E377735EF12E2@CYYPR22MB4262.namprd22.prod.outlook.com>
Message-ID: <CYYPR22MB42628D5041503DE65E86E232F1282@CYYPR22MB4262.namprd22.prod.outlook.com>

We identified the trigger of this behavior.  Due to our own extremely inefficient use of BitSet, the JVM was allocating a 500MB long[].  This array would not fit in the survivor spaces, so it was immediately moved to the Old Generation.  The subsequent Minor GC runs for 1-5 minutes with all 33 Gang Worker threads spinning, executing CompactibleFreeListSpace::block_size.

Our first experiment was to comment out the code that caused the allocation of the 500MB long[].  Without this code, the problematic behavior no longer occurred.

Our second experiment was to include the code that allocated the 500MB long[], but increase the Young Generation size to 20MB.  With this very large Young Space, we observed that the 500MB long[] remained in the survivor spaces for multiple Minor GCs before being promoted to the Old Generation.  The problematic behavior did not occur.


In conclusion, the evidence suggests that the allocation of a very large array that will not fit in the survivor spaces and must therefore be moved directly from Eden to the Old Generation is the trigger of this problematic behavior.  Somehow CompactibleFreeListSpace::block_size is not getting a clear view of an object or multiple objects which could be related to https://bugs.openjdk.org/browse/JDK-8160369.


Thank you again to everyone who replied.

Shane


-----Original Message-----
From: Shane Cox
Sent: Monday, November 25, 2024 2:42 PM
To: hotspot-gc-use at openjdk.org
Subject: RE: Minor GCs consuming an extreme amount of user CPU time

All 33 Gang Worker threads are executing CompactibleFreeListSpace::block_size for multiple consecutive minutes when this problem occurs.  Looking at the source, this function executes a while loop, first checking if the block if free.  If it's not free (i.e., initialized), then it checks the contents of the block using the function klass_or_null.  If klass_or_null returns NULL, then the while loop is repeated.

I see that in Java 9, there was a requirement to eliminate ParNew's use of klass_or_null:  https://bugs.openjdk.org/browse/JDK-8166229

This requirement related to a bug that Thomas worked on back in 2016:  https://bugs.openjdk.org/browse/JDK-8160369

I'm wondering if we're encountering a use case where an object header has been initialized, but the klass_or_null function does not have a clear view of the object and repeatedly returns NULL, causing the looping behavior that we observe.  Does anyone know how I could determine if this is the cause?


-----Original Message-----
From: Shane Cox
Sent: Friday, November 22, 2024 4:05 PM
To: hotspot-gc-use at openjdk.org
Subject: RE: Minor GCs consuming an extreme amount of user CPU time

Thank you Thomas and Charlie for your replies.  We are able to recreate this behavior in a test environment.  Per Thomas' suggestion, we enabled reference GC logging.  Unfortunately that does not appear to be the issue.  Here are example GC log records from our most recent test.

2024-11-22T15:14:44.859-0500: 6362.195: [GC (Allocation Failure) 2024-11-22T15:14:44.859-0500: 6362.195: [ParNew2024-11-22T15:14:51.706-0500: 6369.042: [SoftReference, 0 refs, 0.0000497 secs]2024-11-22T15:14:51.706-0500: 6369.042: [WeakReference, 0 refs, 0.0000160 secs]2024-11-22T15:14:51.706-0500: 6369.042: [FinalReference, 27 refs, 0.0000287 secs]2024-11-22T15:14:51.706-0500: 6369.042: [PhantomReference, 0 refs, 0 refs, 0.0000095 secs]2024-11-22T15:14:51.706-0500: 6369.042: [JNI Weak Reference, 0.0000146 secs]: 2314677K->69934K(2530368K), 6.8470053 secs] 17427466K->15692635K(157005312K), 6.8471532 secs] [Times: user=224.97 sys=0.14, real=6.85 secs]
2024-11-22T15:14:52.175-0500: 6369.511: [GC (Allocation Failure) 2024-11-22T15:14:52.175-0500: 6369.511: [ParNew2024-11-22T15:17:09.061-0500: 6506.397: [SoftReference, 0 refs, 0.0000562 secs]2024-11-22T15:17:09.061-0500: 6506.397: [WeakReference, 1 refs, 0.0000164 secs]2024-11-22T15:17:09.061-0500: 6506.397: [FinalReference, 67 refs, 0.0000326 secs]2024-11-22T15:17:09.061-0500: 6506.397: [PhantomReference, 0 refs, 0 refs, 0.0000094 secs]2024-11-22T15:17:09.061-0500: 6506.397: [JNI Weak Reference, 0.0000159 secs]: 2319214K->71352K(2530368K), 136.8858984 secs] 17941915K->15706880K(157005312K), 136.8860494 secs] [Times: user=4481.92 sys=17.19, real=136.88 secs]
2024-11-22T15:17:09.419-0500: 6506.755: [GC (Allocation Failure) 2024-11-22T15:17:09.419-0500: 6506.755: [ParNew2024-11-22T15:22:08.800-0500: 6806.137: [SoftReference, 0 refs, 0.0000570 secs]2024-11-22T15:22:08.800-0500: 6806.137: [WeakReference, 15 refs, 0.0000107 secs]2024-11-22T15:22:08.800-0500: 6806.137: [FinalReference, 931 refs, 0.0003795 secs]2024-11-22T15:22:08.801-0500: 6806.137: [PhantomReference, 0 refs, 0 refs, 0.0000108 secs]2024-11-22T15:22:08.801-0500: 6806.137: [JNI Weak Reference, 0.0000218 secs]: 2320632K->256239K(2530368K), 299.3821501 secs] 17956160K->15901644K(157005312K), 299.3823927 secs] [Times: user=9764.41 sys=80.57, real=299.38 secs]


Per Charlie's advice, I verified that transparent huge pages are disabled:
> cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]


Since we're able to recreate the problem, I verified that all of the GC GangWorker threads were 100% busy when the problem occurs.  In addition, I ran the Linux perf command while the problem was occurring to gather more information.  Sample output from a 30 second profiling session is below.

Samples: 999K of event 'cycles', Event count (approx.): 3877492515135
  Children      Self  Command  Shared Object       Symbol
+   99.26%     0.00%  java     libpthread-2.28.so  [.] start_thread
+   99.26%     0.00%  java     libjvm.so           [.] java_start
+   99.26%     0.00%  java     libjvm.so           [.] GangWorker::loop
+   99.26%     0.00%  java     libjvm.so           [.] ParNewGenTask::work
+   99.26%     0.00%  java     libjvm.so           [.] GenCollectedHeap::gen_process_roots
+   99.26%     0.00%  java     libjvm.so           [.] ConcurrentMarkSweepGeneration::younger_refs_iterate
+   99.26%     0.00%  java     libjvm.so           [.] CardTableRS::younger_refs_in_space_iterate
+   99.26%     0.00%  java     libjvm.so           [.] CardTableModRefBS::non_clean_card_iterate_possibly_parallel
+   99.26%     0.00%  java     libjvm.so           [.] CardTableModRefBS::non_clean_card_iterate_parallel_work
+   99.25%     0.01%  java     libjvm.so           [.] CardTableModRefBS::process_stride
+   86.82%    79.91%  java     libjvm.so           [.] CompactibleFreeListSpace::block_size
+   83.55%    18.61%  java     libjvm.so           [.] BlockOffsetArrayNonContigSpace::block_start_unsafe


Nearly 80% of the CPU usage was spent in the function CompactibleFreeListSpace::block_size.  Most of the remaining time was spent in BlockOffsetArrayNonContigSpace::block_start_unsafe.


I'm not familiar enough with the JVM code to understand what might be the source of the issue.


-----Original Message-----
From: hotspot-gc-use <hotspot-gc-use-retn at openjdk.org> On Behalf Of charlie hunt
Sent: Friday, November 22, 2024 12:42 PM
To: hotspot-gc-use at openjdk.org
Subject: Re: Minor GCs consuming an extreme amount of user CPU time

WARNING - External email from openjdk.org

To add to comments from Thomas, and Paul Hohensee, the GC logs also tend to show a similar pattern we have seen with CMS at times with transparent huge pages being set to "always". The symptoms tended to be occasional high user CPU time along with some sys CPU time.

So another thing that may be worth looking at is setting the OS's transparent huge pages enabled to "never", and also setting transparent huge pages defrag to "never".

Alternatively you could also try adding -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch to set of JVM command line args instead of mucking with the OS's transparent huge pages configuration.

Charlie

On 11/22/24 2:50 AM, Thomas Schatzl wrote:
> Hi,
>
>   not having a lot of experience with CMS, the following could be
> potential issues here:
>
> * this is a burstable VM and it sporadically does not get cpu time.
> Does not seem so for such a large setup for me.
>
> * I am not sure about CMS (it's really been a long time) but I think
> we fixed some issue with large arrays of references not being well
> distributed across threads in Parallel GC (fairly recently), and CMS
> might have the same issue.
>
> Not sure this would show up as all threads being busy for a long time.
>
> Afair there is not much logging in ParNew/CMS either
> (-XX:+PrintGCDetails may help though).
>
> Maybe you know whether this can happen in your application.
>
> A heap histogram may also help during that time (but
>
> * maybe parallel reference processing not enabled. Use
> -XX:+PrintReferenceGC / -XX:+ParallelRefProcEnabled to enable.
>
> Maybe there is still somebody else out there more knowledgable about
> CMS tuning/issues than me.
>
> Hth,
>   Thomas
>
> On 19.11.24 17:45, Shane Cox wrote:
>> We are observing a transient behavior where Minor GCs consume an
>> extreme amount of user CPU time. Below is an excerpt from a GC log
>> illustrating this behavior. The first line is representative of
>> typical behavior.  Starting with the second line, we observe
>> progressively longer Minor GC pause times and attendant user CPU
>> usage.  The last line shows GC behavior returning to normal.
>>
>> 2024-11-19T05:31:57.096-0500: 39910.586: [GC (Allocation Failure)
>> 2024-11-19T05:31:57.096-0500: 39910.587: [ParNew:
>> 2276397K->23593K(2530368K), 0.8039922 secs]
>> 34156897K->32412975K(157005312K), 0.8042477 secs] [Times: user=26.09
>> sys=0.28, real=0.81 secs]
>>
>> 2024-11-19T05:32:01.942-0500: 39915.432: [GC (Allocation Failure)
>> 2024-11-19T05:32:01.942-0500: 39915.433: [ParNew:
>> 2272873K->22588K(2530368K), 5.9797785 secs]
>> 34662255K->32413062K(157005312K), 5.9800455 secs] [Times: user=195.82
>> sys=0.16, real=5.98 secs]
>>
>> 2024-11-19T05:32:08.811-0500: 39922.302: [GC (Allocation Failure)
>> 2024-11-19T05:32:08.812-0500: 39922.302: [ParNew:
>> 2271868K->24057K(2530368K), 30.6785198 secs]
>> 34662342K->32417812K(157005312K), 30.6788152 secs] [Times:
>> user=1005.22 sys=0.69, real=30.68 secs]
>>
>> 2024-11-19T05:32:40.245-0500: 39953.736: [GC (Allocation Failure)
>> 2024-11-19T05:32:40.245-0500: 39953.736: [ParNew:
>> 2273337K->41786K(2530368K), 48.4038572 secs]
>> 34667092K->32436550K(157005312K), 48.4041764 secs] [Times:
>> user=1586.51 sys=0.85, real=48.41 secs]
>>
>> 2024-11-19T05:33:32.236-0500: 40005.726: [GC (Allocation Failure)
>> 2024-11-19T05:33:32.236-0500: 40005.726: [ParNew:
>> 2291066K->59232K(2530368K), 63.8453443 secs]
>> 34685830K->32454926K(157005312K), 63.8456820 secs] [Times:
>> user=2091.55 sys=1.66, real=63.85 secs]
>>
>> 2024-11-19T05:34:36.698-0500: 40070.189: [GC (Allocation Failure)
>> 2024-11-19T05:34:36.698-0500: 40070.189: [ParNew:
>> 2308512K->74883K(2530368K), 78.2503259 secs]
>> 34704206K->32471817K(157005312K), 78.2506265 secs] [Times:
>> user=2565.48 sys=2.21, real=78.25 secs]
>>
>> 2024-11-19T05:35:55.417-0500: 40148.908: [GC (Allocation Failure)
>> 2024-11-19T05:35:55.418-0500: 40148.908: [ParNew:
>> 2324163K->102130K(2530368K), 96.4663544 secs]
>> 34721097K->32500405K(157005312K), 96.4668756 secs] [Times:
>> user=3161.99 sys=2.26, real=96.46 secs]
>>
>> 2024-11-19T05:37:32.503-0500: 40245.993: [GC (Allocation Failure)
>> 2024-11-19T05:37:32.503-0500: 40245.993: [ParNew:
>> 2351410K->108861K(2530368K), 122.1160632 secs]
>> 34749685K->32508388K(157005312K), 122.1164558 secs] [Times:
>> user=4001.43 sys=3.68, real=122.12 secs]
>>
>> 2024-11-19T05:39:35.187-0500: 40368.678: [GC (Allocation Failure)
>> 2024-11-19T05:39:35.188-0500: 40368.678: [ParNew:
>> 2358141K->144672K(2530368K), 161.7914697 secs]
>> 34757668K->32546599K(157005312K), 161.7916831 secs] [Times:
>> user=5302.53 sys=4.43, real=161.79 secs]
>>
>> 2024-11-19T05:42:17.358-0500: 40530.849: [GC (Allocation Failure)
>> 2024-11-19T05:42:17.359-0500: 40530.849: [ParNew:
>> 2393952K->200667K(2530368K), 279.7220815 secs]
>> 34795879K->32612580K(157005312K), 279.7224630 secs] [Times:
>> user=9164.75 sys=8.37, real=279.72 secs]
>>
>> 2024-11-19T05:46:57.492-0500: 40810.983: [GC (Allocation Failure)
>> 2024-11-19T05:46:57.492-0500: 40810.983: [ParNew:
>> 2449947K->250145K(2530368K), 0.2150744 secs]
>> 34861860K->32690960K(157005312K), 0.2152389 secs] [Times: user=5.89
>> sys=0.70, real=0.22 secs]
>>
>> The GC log records above do not indicate that an excessive amount of
>> GC-related work is being done.  Therefore, I?m guessing that most of
>> the user CPU time is not spent doing productive work.  Instead, I?m
>> guessing that the GC threads were just spinning.
>>
>> The CPU usage on the host was 68-69% during this period.  That makes
>> sense based on the GC log records.  For example, the second to last
>> GC log record reports 9164 CPU seconds and 279 real seconds, which
>> yields a ratio of 32.85.  Divide 32.85 by the 48 CPUs and you get a
>> usage of 68.4% ? which is the CPU usage that we observed on the host
>> at this time.  That being the case, it appears that all of the
>> observed CPU usage was consumed by the GC threads.
>>
>> Any idea what the GC threads might be doing that?s consuming so much
>> CPU usage?  Or suggestions for how I might determine what the GC
>> threads are doing?
>>
>> Thanks,
>>
>> Shane Cox
>>
>> First few lines of GC log containing configuration information:
>>
>> OpenJDK 64-Bit Server VM (25.212-b04) for linux-amd64 JRE
>> (1.8.0_212-b04), built on Apr 19 2019 23:23:17 by "root" with gcc
>> 7.3.1 20180303 (Red Hat 7.3.1-5)
>>
>> Memory: 4k page, physical 196374708k(167484808k free), swap
>> 4194300k(3010032k free)
>>
>> CommandLine flags: -XX:CMSInitiatingOccupancyFraction=75
>> -XX:+ExplicitGCInvokesConcurrent -XX:+HeapDumpOnOutOfMemoryError
>> -XX:InitialHeapSize=161061273600 -XX:MaxHeapSize=161061273600
>> -XX:MaxNewSize=2878971904 -XX:MaxTenuringThreshold=6
>> -XX:NewSize=2878971904 -XX:OldPLABSize=16 -XX:OldSize=5757943808
>> -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails
>> -XX:+PrintGCTimeStamps -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
>>
>> uname output:
>>
>> Linux 4.18.0-305.45.1.el8_4.x86_64 #1 SMP Wed Apr 6 13:48:37 EDT 2022
>> x86_64 x86_64 x86_64 GNU/Linux
>>
>> lscpu output:
>>
>> Architecture:        x86_64
>>
>> CPU op-mode(s):      32-bit, 64-bit
>>
>> Byte Order:          Little Endian
>>
>> CPU(s):              48
>>
>> On-line CPU(s) list: 0-47
>>
>> Thread(s) per core:  2
>>
>> Core(s) per socket:  12
>>
>> Socket(s):           2
>>
>> NUMA node(s):        2
>>
>> Vendor ID:           GenuineIntel
>>
>> CPU family:          6
>>
>> Model:               85
>>
>> Model name:          Intel(R) Xeon(R) Gold 6246 CPU @ 3.30GHz
>>
>> Stepping:            7
>>
>> CPU MHz:             4100.050
>>
>> CPU max MHz:         4200.0000
>>
>> CPU min MHz:         1200.0000
>>
>> BogoMIPS:            6600.00
>>
>> Virtualization:      VT-x
>>
>> L1d cache:           32K
>>
>> L1i cache:           32K
>>
>> L2 cache:            1024K
>>
>> L3 cache:            25344K
>>
>> NUMA node0 CPU(s):
>> 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
>>
>> NUMA node1 CPU(s):
>> 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
>>
>>
>> ---------------------------------------------------------------------
>> ---
>>
>> This message may contain confidential information and is intended for
>> specific recipients unless explicitly noted otherwise. If you have
>> reason to believe you are not an intended recipient of this message,
>> please delete it and notify the sender. This message may not
>> represent the opinion of Intercontinental Exchange, Inc. (ICE), its
>> subsidiaries or affiliates, and does not constitute a contract or
>> guarantee. Unencrypted electronic mail is not secure and the
>> recipient of this message is expected to provide safeguards from
>> viruses and pursue alternate means of communication where privacy or
>> a binding message is desired.
>>
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.org
>> https://secure.ice/?https://mail.openjdk.org/mailman/listinfo/hotspot
>> -gc-use
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.org
> https://secure.ice/?https://mail.openjdk.org/mailman/listinfo/hotspot-
> gc-use
_______________________________________________
hotspot-gc-use mailing list
hotspot-gc-use at openjdk.org
https://secure.ice/?https://mail.openjdk.org/mailman/listinfo/hotspot-gc-use

________________________________

This message may contain confidential information and is intended for specific recipients unless explicitly noted otherwise. If you have reason to believe you are not an intended recipient of this message, please delete it and notify the sender. This message may not represent the opinion of Intercontinental Exchange, Inc. (ICE), its subsidiaries or affiliates, and does not constitute a contract or guarantee. Unencrypted electronic mail is not secure and the recipient of this message is expected to provide safeguards from viruses and pursue alternate means of communication where privacy or a binding message is desired.