Generational ZGC issue

Stefan Johansson stefan.johansson at oracle.com
Fri Feb 16 12:43:10 UTC 2024


Hi Johannes,

We've spent some more time looking at this and getting the json-file to 
reproduced it made it easy to verify our suspicions. Thanks for 
uploading it.

There are a few things playing together here. The test is making quite 
heavy use of DirectByteBuffers and you limit the usage to 2G 
(-XX:MaxDirectMemorySize=2g). The life cycle and freeing of the native 
memory part of the DirectByteBuffer rely on reference processing. In 
generational ZGC reference processing is only done during Major 
collections and since the general GC preassure in this benchmark is very 
low (most objects die young), we do not trigger that many Major 
collections.

Normaly this would not be a problem. To avoid throwing an out of memory 
error (due to hitting the direct buffer memory limit) too early the JDK 
triggers a System.gc(). This should trigger reference procesing and all 
buffers that are no longer in use would be freed. Since you specify the 
option -XX:+DisableExplicitGC all these calls to trigger GCs are ignored 
and no direct memory will be freed. So in our testing, just removing 
this flags makes the test pass.

Another solution is to look at using HeapByteBuffers instead and don't 
have to worry about the direct memory usage. The OpenHFT lib seems to 
have support for this by just using elasticHeapByteBuffer(...) instead 
of elasticByteBuffer().

Lastly, the reason for this working with non-generational ZGC is that it 
does reference processing for every GC.

Hope this helps,
StefanJ


On 2024-02-15 21:53, Johannes Lichtenberger wrote:
> It's a laptop, I've attached some details.
> 
> Furthermore, if it seems worth digging deeper into the issue, the JSON 
> file is here for one week: 
> https://www.transfernow.net/dl/20240215j9NaPTc0 
> <https://www.transfernow.net/dl/20240215j9NaPTc0>
> 
> You'd have to unzip into bundles/sirix-core/src/test/resources/json, 
> remove the @Disabled annotation and run the test 
> JsonShredderTest::testChicagoDescendantAxis
> 
> The test JVM parameters are specified in the parent build.gradle in the 
> project root folder.
> 
> The GitHub repo: https://github.com/sirixdb/sirix 
> <https://github.com/sirixdb/sirix>
> 
> Screenshot from 2024-02-15 21-43-33.png
> 
> kind regards
> Johannes
> 
> Am Do., 15. Feb. 2024 um 20:01 Uhr schrieb Peter Booth 
> <peter_booth at me.com <mailto:peter_booth at me.com>>:
> 
>     Just curious - what CPU, physical memory and OS are you using?
>     Sent from my iPhone
> 
>>     On Feb 15, 2024, at 12:23 PM, Johannes Lichtenberger
>>     <lichtenberger.johannes at gmail.com
>>     <mailto:lichtenberger.johannes at gmail.com>> wrote:
>>
>>     
>>     I guess I don't know which JDK it picks for the tests, but I guess
>>     OpenJDK
>>
>>     Johannes Lichtenberger <lichtenberger.johannes at gmail.com
>>     <mailto:lichtenberger.johannes at gmail.com>> schrieb am Do., 15.
>>     Feb. 2024, 17:58:
>>
>>         However, it's the same with: ./gradlew
>>         -Dorg.gradle.java.home=/home/johannes/.jdks/openjdk-21.0.2
>>         :sirix-core:test --tests
>>         io.sirix.service.json.shredder.JsonShredderTest.testChicagoDescendantAxis   using OpenJDK hopefully
>>
>>         Am Do., 15. Feb. 2024 um 17:54 Uhr schrieb Johannes
>>         Lichtenberger <lichtenberger.johannes at gmail.com
>>         <mailto:lichtenberger.johannes at gmail.com>>:
>>
>>             I've attached two logs, the first one without
>>             -XX:+Generational, the second one with the option set,
>>             even though I also saw, that generational ZGC is going to
>>             be supported in GraalVM 24.1 in September... so not sure
>>             what this does :)
>>
>>             Am Do., 15. Feb. 2024 um 17:52 Uhr schrieb Johannes
>>             Lichtenberger <lichtenberger.johannes at gmail.com
>>             <mailto:lichtenberger.johannes at gmail.com>>:
>>
>>                 Strange, so does it simply ignore the option? The
>>                 following is the beginning of the output from _non_
>>                 generational ZGC:
>>
>>                 johannes at luna:~/IdeaProjects/sirix$ ./gradlew
>>                 -Dorg.gradle.java.home=/home/johannes/.sdkman/candidates/java/21.0.2-graal :sirix-core:test --tests io.sirix.service.json.shredder.JsonShredderTest.testChicagoDescendantAxis
>>
>>                 > Configure project :
>>                 The 'sonarqube' task depends on compile tasks. This
>>                 behavior is now deprecated and will be removed in
>>                 version 5.x. To avoid implicit compilation, set
>>                 property 'sonar.gradle.skipCompile' to 'true' and make
>>                 sure your project is compiled, before analysis has
>>                 started.
>>                 The 'sonar' task depends on compile tasks. This
>>                 behavior is now deprecated and will be removed in
>>                 version 5.x. To avoid implicit compilation, set
>>                 property 'sonar.gradle.skipCompile' to 'true' and make
>>                 sure your project is compiled, before analysis has
>>                 started.
>>                 [1,627s][info   ][gc      ] GC(0) Garbage Collection
>>                 (Metadata GC Threshold) 84M(1%)->56M(0%)
>>
>>                 > Task :sirix-core:test
>>                 [0.001s][warning][pagesize] UseLargePages disabled, no
>>                 large pages configured and available on the system.
>>                 [1.253s][info   ][gc      ] Using The Z Garbage Collector
>>
>>                 [2,930s][info   ][gc      ] GC(1) Garbage Collection
>>                 (Warmup) 1616M(11%)->746M(5%)
>>                 [4,445s][info   ][gc      ] GC(2) Garbage Collection
>>                 (Warmup) 3232M(21%)->750M(5%)
>>                 [5,751s][info   ][gc      ] GC(3) Garbage Collection
>>                 (Warmup) 4644M(30%)->1356M(9%)
>>                 [9,886s][info   ][gc      ] GC(4) Garbage Collection
>>                 (Allocation Rate) 10668M(69%)->612M(4%)
>>                 [10,406s][info   ][gc      ] GC(5) Garbage Collection
>>                 (Allocation Rate) 2648M(17%)->216M(1%)
>>                 [13,931s][info   ][gc      ] GC(6) Garbage Collection
>>                 (Allocation Rate) 11164M(73%)->1562M(10%)
>>                 [16,908s][info   ][gc      ] GC(7) Garbage Collection
>>                 (Allocation Rate) 11750M(76%)->460M(3%)
>>                 [20,690s][info   ][gc      ] GC(8) Garbage Collection
>>                 (Allocation Rate) 12670M(82%)->726M(5%)
>>                 [24,376s][info   ][gc      ] GC(9) Garbage Collection
>>                 (Allocation Rate) 13422M(87%)->224M(1%)
>>                 [28,152s][info   ][gc      ] GC(10) Garbage Collection
>>                 (Proactive) 13474M(88%)->650M(4%)
>>                 [31,526s][info   ][gc      ] GC(11) Garbage Collection
>>                 (Allocation Rate) 12072M(79%)->1472M(10%)
>>                 [34,754s][info   ][gc      ] GC(12) Garbage Collection
>>                 (Allocation Rate) 13050M(85%)->330M(2%)
>>                 [38,478s][info   ][gc      ] GC(13) Garbage Collection
>>                 (Allocation Rate) 13288M(87%)->762M(5%)
>>                 [41,936s][info   ][gc      ] GC(14) Garbage Collection
>>                 (Proactive) 13294M(87%)->504M(3%)
>>                 [45,353s][info   ][gc      ] GC(15) Garbage Collection
>>                 (Allocation Rate) 12984M(85%)->268M(2%)
>>                 [48,861s][info   ][gc      ] GC(16) Garbage Collection
>>                 (Allocation Rate) 13008M(85%)->306M(2%)
>>                 [52,133s][info   ][gc      ] GC(17) Garbage Collection
>>                 (Proactive) 12042M(78%)->538M(4%)
>>                 [55,705s][info   ][gc      ] GC(18) Garbage Collection
>>                 (Allocation Rate) 12420M(81%)->1842M(12%)
>>                 [59,000s][info   ][gc      ] GC(19) Garbage Collection
>>                 (Allocation Rate) 12458M(81%)->1422M(9%)
>>                 [64,501s][info   ][gc      ] Allocation Stall (Test
>>                 worker) 59,673ms
>>                 [64,742s][info   ][gc      ] Allocation Stall (Test
>>                 worker) 240,077ms
>>                 [65,806s][info   ][gc      ] GC(20) Garbage Collection
>>                 (Allocation Rate) 13808M(90%)->6936M(45%)
>>                 [66,476s][info   ][gc      ] GC(21) Garbage Collection
>>                 (Allocation Stall) 7100M(46%)->4478M(29%)
>>                 [69,471s][info   ][gc      ] GC(22) Garbage Collection
>>                 (Allocation Rate) 10098M(66%)->5888M(38%)
>>                 [72,252s][info   ][gc      ] GC(23) Garbage Collection
>>                 (Allocation Rate) 11226M(73%)->5816M(38%)
>>
>>                 ...
>>
>>                 So even here I can see some allocation stalls.
>>
>>                 Running the Same with -XX:+ZGenerational in
>>                 build.gradle probably using GraalVM does something
>>                 differnt, but I don't know what... at least off-heap
>>                 memory is exhausted at some point due to direct byte
>>                 buffer usage!?
>>
>>                 So, I'm not sure what's the difference, though.
>>
>>                 With this:
>>
>>                 "-XX:+UseZGC",
>>                 "-Xlog:gc*=debug:file=zgc-generational-detailed.log",
>>                 "-XX:+ZGenerational",
>>                 "-verbose:gc",
>>                 "-XX:+HeapDumpOnOutOfMemoryError",
>>                 "-XX:HeapDumpPath=heapdump.hprof",
>>                 "-XX:MaxDirectMemorySize=2g",
>>
>>
>>                 Caused by: java.lang.OutOfMemoryError: Cannot reserve 60000 bytes of direct buffer memory (allocated: 2147446560, limit: 2147483648)
>>                 	at java.base/java.nio.Bits.reserveMemory(Bits.java:178)
>>                 	at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:111)
>>                 	at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:360)
>>                 	at net.openhft.chronicle.bytes.internal.NativeBytesStore.elasticByteBuffer(NativeBytesStore.java:191)
>>                 	at net.openhft.chronicle.bytes.BytesStore.elasticByteBuffer(BytesStore.java:192)
>>                 	at net.openhft.chronicle.bytes.Bytes.elasticByteBuffer(Bytes.java:176)
>>                 	at net.openhft.chronicle.bytes.Bytes.elasticByteBuffer(Bytes.java:148)
>>                 	at io.sirix.access.trx.page.NodePageTrx.lambda$parallelSerializationOfKeyValuePages$1(NodePageTrx.java:443)
>>
>>
>>
>>                 Am Do., 15. Feb. 2024 um 12:05 Uhr schrieb Stefan
>>                 Karlsson <stefan.karlsson at oracle.com
>>                 <mailto:stefan.karlsson at oracle.com>>:
>>
>>                     Hi Johannes,
>>
>>                     We tried to look at the log files and the jfr
>>                     files, but couldn't find
>>                     an OotOfMemoryError in any of them. Do you think
>>                     you could try to rerun
>>                     and capture the entire GC log from the
>>                     OutOfMemoryError run?
>>
>>                     A few things to note:
>>
>>                     1) You seem to be running the Graal compiler.
>>                     Graal doesn't support
>>                     Generational ZGC, so you are going to run
>>                     different compilers when you
>>                     compare Singlegen ZGC with Generational ZGC.
>>
>>                     2) It's not clear to me that the provided JFR
>>                     files matches the provided
>>                     log files.
>>
>>                     3) The JFR files show that -XX:+UseLargePages are
>>                     used, but the gc+init
>>                     logs shows 'Large Page Support: Disabled', you
>>                     might want to look into
>>                     why that is the case.
>>
>>                     4) The singlegen JFR file has a
>>                     -Xlog:gc:g1-chicago.log line. It should
>>                     probably be named zgc-chicago.log.
>>
>>                     Cheers,
>>                     StefanK
>>
>>                     On 2024-02-14 17:36, Johannes Lichtenberger wrote:
>>                     > Hello,
>>                     >
>>                     > a test of my little DB project fails using
>>                     generational ZGC, but not
>>                     > with ZGC and G1 (out of memory error).
>>                     >
>>                     > To be honest, I guess the allocation rate and
>>                     thus GC pressure, when
>>                     > traversing a resource in SirixDB is
>>                     unacceptable. The strategy is to
>>                     > create fine-grained nodes from JSON input and
>>                     store these in a trie.
>>                     > First, a 3,8Gb JSON file is shredded and
>>                     imported. Next, a preorder
>>                     > traversal of the generated trie traverses a trie
>>                     (with leaf pages
>>                     > storing 1024 nodes each and in total
>>                     ~300_000_000 (and these are going
>>                     > to be deserialized one by one). The pages are
>>                     furthermore referenced
>>                     > in memory through PageReference::setPage.
>>                     Furthermore, a Caffeine page
>>                     > cache caches the PageReferences (keys) and the
>>                     pages (values) and sets
>>                     > the reference back to null once entries are
>>                     going to be evicted
>>                     > (PageReference.setPage(null)).
>>                     >
>>                     > However, I think the whole strategy of having to
>>                     have in-memory nodes
>>                     > might not be best. Maybe it's better to use
>>                     off-heap memory for the
>>                     > pages itself with MemorySegments, but the pages
>>                     are not of a fixed
>>                     > size, thus it may get tricky.
>>                     >
>>                     > The test mentioned is this:
>>                     >
>>                     https://github.com/sirixdb/sirix/blob/248ab141632c94c6484a3069a056550516afb1d2/bundles/sirix-core/src/test/java/io/sirix/service/json/shredder/JsonShredderTest.java#L69 <https://github.com/sirixdb/sirix/blob/248ab141632c94c6484a3069a056550516afb1d2/bundles/sirix-core/src/test/java/io/sirix/service/json/shredder/JsonShredderTest.java#L69>
>>                     >
>>                     > I can upload the JSON file somewhere for a
>>                     couple of days if needed.
>>                     >
>>                     > Caused by: java.lang.OutOfMemoryError
>>                     >     at
>>                     >
>>                     java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
>>                     >     at
>>                     >
>>                     java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
>>                     >     at
>>                     >
>>                     java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486)
>>                     >     at
>>                     >
>>                     java.base/java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:542)
>>                     >     at
>>                     >
>>                     java.base/java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:567)
>>                     >     at
>>                     >
>>                     java.base/java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:670)
>>                     >     at
>>                     >
>>                     java.base/java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160)
>>                     >     at
>>                     >
>>                     java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
>>                     >     at
>>                     >
>>                     java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
>>                     >     at
>>                     >
>>                     java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
>>                     >     at
>>                     > io.sirix.access.trx.page
>>                     <http://io.sirix.access.trx.page>.NodePageTrx.parallelSerializationOfKeyValuePages(NodePageTrx.java:442)
>>                     >
>>                     > I've uploaded several JFR recordings and logs
>>                     over here (maybe besides
>>                     > the async profiler JFR files the zgc-detailed
>>                     log is most interesting):
>>                     >
>>                     >
>>                     https://github.com/sirixdb/sirix/tree/main/bundles/sirix-core <https://github.com/sirixdb/sirix/tree/main/bundles/sirix-core>
>>                     >
>>                     > kind regards
>>                     > Johannes
>>


More information about the zgc-dev mailing list