<html><head></head><body><div class="ydpca63eda4yahoo-style-wrap" style="font-family:Helvetica Neue, Helvetica, Arial, sans-serif;font-size:13px;"><div></div>
<div dir="ltr" data-setdir="false">Hi Stefan,</div><div dir="ltr" data-setdir="false"><br></div><div dir="ltr" data-setdir="false">Thanks for the detailed explanations. I have a much better understanding of the situation now. I will also check configuration static pool of large pages as you suggest.</div><div dir="ltr" data-setdir="false"><br></div><div dir="ltr" data-setdir="false">Regards,</div><div dir="ltr" data-setdir="false"><br></div><div dir="ltr" data-setdir="false">Evaristo</div><div><br></div>
</div><div id="yahoo_quoted_4146736101" class="yahoo_quoted">
<div style="font-family:'Helvetica Neue', Helvetica, Arial, sans-serif;font-size:13px;color:#26282a;">
<div>
En martes, 13 de agosto de 2024, 19:07:39 CEST, Stefan Johansson <stefan.johansson@oracle.com> escribió:
</div>
<div><br></div>
<div><br></div>
<div><div dir="ltr">Hi Evaristo,<br clear="none"><br clear="none">There are a few things to keep in mind here. Comments below.<br clear="none"><br clear="none">On 2024-08-12 19:37, Evaristo José Camarero wrote:<br clear="none">> Hi,<br clear="none">> <br clear="none">> I am interested in using -XX:+AlwaysPreTouch and I was checking the <br clear="none">> delay during booting. I am also using TransparentHugePages<br clear="none">> <br clear="none">> THP config<br clear="none">> [cheva-virtualmachine ~]# cat <br clear="none">> /sys/kernel/mm/transparent_hugepage/shmem_enabled<br clear="none">> always within_size [advise] never deny force<br clear="none">> [cheva-virtualmachine ~]# cat /sys/kernel/mm/transparent_hugepage/enabled<br clear="none">> [always] madvise never<br clear="none">> <br clear="none">> <br clear="none"><br clear="none">Thanks for providing this config and also checking shmem_enabled. As you <br clear="none">probably know ZGC relies on shared memory while G1 is not. So the above <br clear="none">config will allow ZGC to use transparent huge pages when enabled on the <br clear="none">command-line while G1 will always get transparent huge pages if the <br clear="none">memory is aligned correctly (even with -XX:-UseTransparentHugePages).<br clear="none"><br clear="none">So the below command-lines won't be fair since the G1 heap will be <br clear="none">backed with THP but not the ZGC heap.<br clear="none"><br clear="none">> <br clear="none">> I am running a VM with 8 cores and I observed that PreTouch is much <br clear="none">> faster with G1 that with Gen ZGC. Main reason is that I could check that <br clear="none">> G1 is using 8 concurrent threads for doing the job while Gen ZGC was <br clear="none">> using 2 threads (I used top -d 1 and observed the busy threads there)<br clear="none">> <br clear="none">> <br clear="none">> I made this test with 32G heap, BUT the production enviornment is <br clear="none">> running with 300G so I expect the figures to be even more different.<br clear="none">> <br clear="none">> # G1 - Using 8 cores (GC Thread #0 ..#8)<br clear="none">> $> time java -Xmx32G -Xms32G -XX:-UseTransparentHugePages <br clear="none">> -XX:+AlwaysPreTouch -version<br clear="none">> openjdk version "21.0.4" 2024-07-16 LTS<br clear="none">> OpenJDK Runtime Environment Zulu21.36+17-CA (build 21.0.4+7-LTS)<br clear="none">> OpenJDK 64-Bit Server VM Zulu21.36+17-CA (build 21.0.4+7-LTS, mixed <br clear="none">> mode, sharing)<br clear="none">> java -Xmx32G -Xms32G -XX:-UseTransparentHugePages -XX:+AlwaysPreTouch <br clear="none">> -versio 0,44s user 12,28s system 688% cpu 1,848 total<br clear="none">> <br clear="none">> <br clear="none">> #Gen ZGC - Using 1 thread and at some point switch to 2 threads <br clear="none">> (ZGCWorker#0 and #1)<br clear="none">> $> time java -Xmx34G -Xms34G -XX:+UseZGC -XX:+ZGenerational <br clear="none">> -XX:-UseTransparentHugePages -XX:+AlwaysPreTouch -version <br clear="none">> ✔<br clear="none">> openjdk version "21.0.4" 2024-07-16 LTS<br clear="none">> OpenJDK Runtime Environment Zulu21.36+17-CA (build 21.0.4+7-LTS)<br clear="none">> OpenJDK 64-Bit Server VM Zulu21.36+17-CA (build 21.0.4+7-LTS, mixed <br clear="none">> mode, sharing)<br clear="none">> java -Xmx34G -Xms34G -XX:+UseZGC -XX:+ZGenerational <br clear="none">> -XX:+AlwaysPreTouch 1,08s user 11,92s system 136% cpu 9,530 total<br clear="none">> <br clear="none">> <br clear="none"><br clear="none">I reran this locally and set the "ActiveProcessorCount" to 8 to limit my <br clear="none">cores a bit. I also added some logging (but cut away some parts) to <br clear="none">better understand and see the differences between G1 and ZGC. Because it <br clear="none">is not only pre-touch that causes the difference in startup time (which <br clear="none">can be seen by removing -XX:+AlwaysPreToch from you command-lines). The <br clear="none">slower startup also comes from how the ZGC heap is setup with shared memory.<br clear="none"><br clear="none">G1<br clear="none">---<br clear="none">$ time jdk-21/bin/java -Xmx32G -Xms32G -XX:-UseTransparentHugePages <br clear="none">-XX:+AlwaysPreTouch -XX:ActiveProcessorCount=8 <br clear="none">-Xlog:gc+heap*=debug,gc+init -version<br clear="none">[0.005s][debug][gc,heap] Minimum heap 34359738368 Initial heap <br clear="none">34359738368 Maximum heap 34359738368<br clear="none">[0.006s][debug][gc,heap] Running G1 PreTouch with 8 workers for 8192 <br clear="none">work units pre-touching 34359738368B.<br clear="none">[0.772s][debug][gc,heap] Running G1 PreTouch with 8 workers for 128 work <br clear="none">units pre-touching 536870912B.<br clear="none">[0.785s][debug][gc,heap] Running G1 PreTouch with 8 workers for 16 work <br clear="none">units pre-touching 67108864B.<br clear="none">[0.787s][debug][gc,heap] Running G1 PreTouch with 8 workers for 16 work <br clear="none">units pre-touching 67108864B.<br clear="none">[0.799s][info ][gc,init] Version: 21+35-LTS-2513 (release)<br clear="none">[0.799s][info ][gc,init] Parallel Workers: 8<br clear="none">...<br clear="none">java version "21" 2023-09-19 LTS<br clear="none">Java(TM) SE Runtime Environment (build 21+35-LTS-2513)<br clear="none">Java HotSpot(TM) 64-Bit Server VM (build 21+35-LTS-2513, mixed mode, <br clear="none">sharing)<br clear="none"><br clear="none">real 0m0.901s<br clear="none">user 0m0.185s<br clear="none">sys 0m6.163s<br clear="none">---<br clear="none"><br clear="none">ZGC<br clear="none">---<br clear="none">$ time jdk-21/bin/java -Xmx32G -Xms32G -XX:-UseTransparentHugePages <br clear="none">-XX:+AlwaysPreTouch -XX:ActiveProcessorCount=8 <br clear="none">-Xlog:gc+task*=debug,gc+init -XX:+UseZGC -XX:+ZGenerational -version<br clear="none">[0.006s][info][gc,init] Initializing The Z Garbage Collector<br clear="none">...<br clear="none">[0.007s][info][gc,init] GC Workers for Old Generation: 2 (dynamic)<br clear="none">[0.007s][info][gc,init] GC Workers for Young Generation: 2 (dynamic)<br clear="none">[3.983s][debug][gc,task] Executing ZPreTouchTask using ZWorkerOld with 2 <br clear="none">workers<br clear="none">[10.886s][info ][gc,init] GC Workers Max: 2 (dynamic)<br clear="none">[10.887s][info ][gc,init] Runtime Workers: 5<br clear="none">java version "21" 2023-09-19 LTS<br clear="none">Java(TM) SE Runtime Environment (build 21+35-LTS-2513)<br clear="none">Java HotSpot(TM) 64-Bit Server VM (build 21+35-LTS-2513, mixed mode, <br clear="none">sharing)<br clear="none"><br clear="none">real 0m14.218s<br clear="none">user 0m1.387s<br clear="none">sys 0m19.690s<br clear="none">---<br clear="none"><br clear="none">Above we can see that G1 spend ~765ms pre-touching the heap using 8 <br clear="none">threads. In the ZGC case we can see that the actual pre-touching doesn't <br clear="none">start until after ~4s. The time spent before that is just setting up the <br clear="none">heap. We then see ZGC spending almost 7s on pre-touching using only 2 <br clear="none">threads. This can be sped up by using more workers and we might want to <br clear="none">look more at this. There are other features and plans in related areas <br clear="none">and in JDK 23 the pre-touch implementation has changed. So many things <br clear="none">are moving in this area.<br clear="none"><br clear="none">But for things to be more fair we should also run ZGC with THP set to <br clear="none">always for shmem_enabled.<br clear="none"><br clear="none">ZGC (shmem_enabled = always)<br clear="none">----------------------------<br clear="none">$ time jdk-21/bin/java -Xmx32G -Xms32G -XX:-UseTransparentHugePages <br clear="none">-XX:+AlwaysPreTouch -XX:ActiveProcessorCount=8 <br clear="none">-Xlog:gc+task*=debug,gc+init -XX:+UseZGC -XX:+ZGenerational -version<br clear="none">[0.006s][info][gc,init] Initializing The Z Garbage Collector<br clear="none">[0.006s][info][gc,init] Version: 21+35-LTS-2513 (release)<br clear="none">...<br clear="none">[0.006s][info][gc,init] Heap Backing File: /memfd:java_heap<br clear="none">[0.006s][info][gc,init] Heap Backing Filesystem: tmpfs (0x1021994)<br clear="none">...<br clear="none">[5.488s][debug][gc,task] Executing ZPreTouchTask using ZWorkerOld with 2 <br clear="none">workers<br clear="none">[5.675s][info ][gc,init] GC Workers Max: 2 (dynamic)<br clear="none">[5.676s][info ][gc,init] Runtime Workers: 5<br clear="none">java version "21" 2023-09-19 LTS<br clear="none">Java(TM) SE Runtime Environment (build 21+35-LTS-2513)<br clear="none">Java HotSpot(TM) 64-Bit Server VM (build 21+35-LTS-2513, mixed mode, <br clear="none">sharing)<br clear="none"><br clear="none">real 0m5.938s<br clear="none">user 0m0.289s<br clear="none">sys 0m5.833s<br clear="none">---<br clear="none"><br clear="none">So even longer time to setup the heap, but the actual pre-touching is <br clear="none">very quick, roughly 200ms using just 2 workers (which looks a bit <br clear="none">strange to me). So the main difference isn't really the pre-touch time <br clear="none">but the cost of setting up the heap with shared memory. To avoid this <br clear="none">cost it is possible to use explicit large pages (HugeTLBFS) instead.<br clear="none"><br clear="none">I hope this helps getting a better understanding of what is taking time. <br clear="none">When it comes to pre-touching we do know that ZGC is using fewer threads <br clear="none">compared to G1, and this might be something to look at going forward.<br clear="none"><br clear="none">Thanks,<br clear="none">StefanJ<div class="yqt9881519212" id="yqtfd49403"><br clear="none"><br clear="none"><br clear="none">> Non generational ZGC is even slower.<br clear="none">> <br clear="none">> <br clear="none">> In this case, GenZGC is 5 times slower than G1 and it is NOT using all <br clear="none">> available cores to do the job.<br clear="none">> <br clear="none">> Is this somehow expected behaviour? Maybe could be optimized or there is <br clear="none">> any reason to avoid using more threads?<br clear="none">> <br clear="none">> Thanks in advance,<br clear="none">> <br clear="none">> Evaristo<br clear="none">> <br clear="none"></div></div></div>
</div>
</div></body></html>