ZGC and -XX:+AlwaysPreTouch performance

Wed Aug 14 06:12:17 UTC 2024

 Hi Stefan,
Thanks for the detailed explanations. I have a much better understanding of the situation now. I will also check configuration static pool of large pages as you suggest.
Regards,
Evaristo
    En martes, 13 de agosto de 2024, 19:07:39 CEST, Stefan Johansson <stefan.johansson at oracle.com> escribió:  

 Hi Evaristo,

There are a few things to keep in mind here. Comments below.

On 2024-08-12 19:37, Evaristo José Camarero wrote:
> Hi,
> 
> I am interested in using -XX:+AlwaysPreTouch and I was checking the 
> delay during booting. I am also using TransparentHugePages
> 
> THP config
> [cheva-virtualmachine ~]# cat 
> /sys/kernel/mm/transparent_hugepage/shmem_enabled
> always within_size [advise] never deny force
> [cheva-virtualmachine ~]# cat /sys/kernel/mm/transparent_hugepage/enabled
> [always] madvise never
> 
> 

Thanks for providing this config and also checking shmem_enabled. As you 
probably know ZGC relies on shared memory while G1 is not. So the above 
config will allow ZGC to use transparent huge pages when enabled on the 
command-line while G1 will always get transparent huge pages if the 
memory is aligned correctly (even with -XX:-UseTransparentHugePages).

So the below command-lines won't be fair since the G1 heap will be 
backed with THP but not the ZGC heap.

> 
> I am running a VM with 8 cores and I observed that PreTouch is much 
> faster with G1 that with Gen ZGC. Main reason is that I could check that 
> G1 is using 8 concurrent threads for doing the job while Gen ZGC was 
> using 2 threads (I used top -d 1 and observed the busy threads there)
> 
> 
> I made this test with 32G heap, BUT the production enviornment is 
> running with 300G so I expect the figures to be even more different.
> 
> # G1 - Using 8 cores (GC Thread #0 ..#8)
> $> time java -Xmx32G -Xms32G  -XX:-UseTransparentHugePages 
> -XX:+AlwaysPreTouch -version
> openjdk version "21.0.4" 2024-07-16 LTS
> OpenJDK Runtime Environment Zulu21.36+17-CA (build 21.0.4+7-LTS)
> OpenJDK 64-Bit Server VM Zulu21.36+17-CA (build 21.0.4+7-LTS, mixed 
> mode, sharing)
> java -Xmx32G -Xms32G -XX:-UseTransparentHugePages -XX:+AlwaysPreTouch 
> -versio  0,44s user 12,28s system 688% cpu 1,848 total
> 
> 
> #Gen ZGC - Using 1 thread and at some point switch to 2 threads 
> (ZGCWorker#0 and #1)
> $> time java -Xmx34G -Xms34G -XX:+UseZGC  -XX:+ZGenerational  
> -XX:-UseTransparentHugePages -XX:+AlwaysPreTouch -version                
>                           ✔
> openjdk version "21.0.4" 2024-07-16 LTS
> OpenJDK Runtime Environment Zulu21.36+17-CA (build 21.0.4+7-LTS)
> OpenJDK 64-Bit Server VM Zulu21.36+17-CA (build 21.0.4+7-LTS, mixed 
> mode, sharing)
> java -Xmx34G -Xms34G -XX:+UseZGC -XX:+ZGenerational  
> -XX:+AlwaysPreTouch   1,08s user 11,92s system 136% cpu 9,530 total
> 
> 

I reran this locally and set the "ActiveProcessorCount" to 8 to limit my 
cores a bit. I also added some logging (but cut away some parts) to 
better understand and see the differences between G1 and ZGC. Because it 
is not only pre-touch that causes the difference in startup time (which 
can be seen by removing -XX:+AlwaysPreToch from you command-lines). The 
slower startup also comes from how the ZGC heap is setup with shared memory.

G1
---
$ time jdk-21/bin/java -Xmx32G -Xms32G -XX:-UseTransparentHugePages 
-XX:+AlwaysPreTouch -XX:ActiveProcessorCount=8 
-Xlog:gc+heap*=debug,gc+init -version
[0.005s][debug][gc,heap] Minimum heap 34359738368  Initial heap 
34359738368  Maximum heap 34359738368
[0.006s][debug][gc,heap] Running G1 PreTouch with 8 workers for 8192 
work units pre-touching 34359738368B.
[0.772s][debug][gc,heap] Running G1 PreTouch with 8 workers for 128 work 
units pre-touching 536870912B.
[0.785s][debug][gc,heap] Running G1 PreTouch with 8 workers for 16 work 
units pre-touching 67108864B.
[0.787s][debug][gc,heap] Running G1 PreTouch with 8 workers for 16 work 
units pre-touching 67108864B.
[0.799s][info ][gc,init] Version: 21+35-LTS-2513 (release)
[0.799s][info ][gc,init] Parallel Workers: 8
...
java version "21" 2023-09-19 LTS
Java(TM) SE Runtime Environment (build 21+35-LTS-2513)
Java HotSpot(TM) 64-Bit Server VM (build 21+35-LTS-2513, mixed mode, 
sharing)

real    0m0.901s
user    0m0.185s
sys    0m6.163s
---

ZGC
---
$ time jdk-21/bin/java -Xmx32G -Xms32G -XX:-UseTransparentHugePages 
-XX:+AlwaysPreTouch -XX:ActiveProcessorCount=8 
-Xlog:gc+task*=debug,gc+init -XX:+UseZGC -XX:+ZGenerational -version
[0.006s][info][gc,init] Initializing The Z Garbage Collector
...
[0.007s][info][gc,init] GC Workers for Old Generation: 2 (dynamic)
[0.007s][info][gc,init] GC Workers for Young Generation: 2 (dynamic)
[3.983s][debug][gc,task] Executing ZPreTouchTask using ZWorkerOld with 2 
workers
[10.886s][info ][gc,init] GC Workers Max: 2 (dynamic)
[10.887s][info ][gc,init] Runtime Workers: 5
java version "21" 2023-09-19 LTS
Java(TM) SE Runtime Environment (build 21+35-LTS-2513)
Java HotSpot(TM) 64-Bit Server VM (build 21+35-LTS-2513, mixed mode, 
sharing)

real    0m14.218s
user    0m1.387s
sys    0m19.690s
---

Above we can see that G1 spend ~765ms pre-touching the heap using 8 
threads. In the ZGC case we can see that the actual pre-touching doesn't 
start until after ~4s. The time spent before that is just setting up the 
heap. We then see ZGC spending almost 7s on pre-touching using only 2 
threads. This can be sped up by using more workers and we might want to 
look more at this. There are other features and plans in related areas 
and in JDK 23 the pre-touch implementation has changed. So many things 
are moving in this area.

But for things to be more fair we should also run ZGC with THP set to 
always for shmem_enabled.

ZGC (shmem_enabled = always)
----------------------------
$ time jdk-21/bin/java -Xmx32G -Xms32G -XX:-UseTransparentHugePages 
-XX:+AlwaysPreTouch -XX:ActiveProcessorCount=8 
-Xlog:gc+task*=debug,gc+init -XX:+UseZGC -XX:+ZGenerational -version
[0.006s][info][gc,init] Initializing The Z Garbage Collector
[0.006s][info][gc,init] Version: 21+35-LTS-2513 (release)
...
[0.006s][info][gc,init] Heap Backing File: /memfd:java_heap
[0.006s][info][gc,init] Heap Backing Filesystem: tmpfs (0x1021994)
...
[5.488s][debug][gc,task] Executing ZPreTouchTask using ZWorkerOld with 2 
workers
[5.675s][info ][gc,init] GC Workers Max: 2 (dynamic)
[5.676s][info ][gc,init] Runtime Workers: 5
java version "21" 2023-09-19 LTS
Java(TM) SE Runtime Environment (build 21+35-LTS-2513)
Java HotSpot(TM) 64-Bit Server VM (build 21+35-LTS-2513, mixed mode, 
sharing)

real    0m5.938s
user    0m0.289s
sys    0m5.833s
---

So even longer time to setup the heap, but the actual pre-touching is 
very quick, roughly 200ms using just 2 workers (which looks a bit 
strange to me). So the main difference isn't really the pre-touch time 
but the cost of setting up the heap with shared memory. To avoid this 
cost it is possible to use explicit large pages (HugeTLBFS) instead.

I hope this helps getting a better understanding of what is taking time. 
When it comes to pre-touching we do know that ZGC is using fewer threads 
compared to G1, and this might be something to look at going forward.

Thanks,
StefanJ

> Non generational ZGC is even slower.
> 
> 
> In this case, GenZGC is 5 times slower than G1 and it is NOT using all 
> available cores to do the job.
> 
> Is this somehow expected behaviour? Maybe could be optimized or there is 
> any reason to avoid using more threads?
> 
> Thanks in advance,
> 
> Evaristo
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/zgc-dev/attachments/20240814/5b9d575d/attachment-0001.htm>