Generational ZGC issue
Stefan Johansson
stefan.johansson at oracle.com
Fri Feb 16 12:43:10 UTC 2024
Hi Johannes,
We've spent some more time looking at this and getting the json-file to
reproduced it made it easy to verify our suspicions. Thanks for
uploading it.
There are a few things playing together here. The test is making quite
heavy use of DirectByteBuffers and you limit the usage to 2G
(-XX:MaxDirectMemorySize=2g). The life cycle and freeing of the native
memory part of the DirectByteBuffer rely on reference processing. In
generational ZGC reference processing is only done during Major
collections and since the general GC preassure in this benchmark is very
low (most objects die young), we do not trigger that many Major
collections.
Normaly this would not be a problem. To avoid throwing an out of memory
error (due to hitting the direct buffer memory limit) too early the JDK
triggers a System.gc(). This should trigger reference procesing and all
buffers that are no longer in use would be freed. Since you specify the
option -XX:+DisableExplicitGC all these calls to trigger GCs are ignored
and no direct memory will be freed. So in our testing, just removing
this flags makes the test pass.
Another solution is to look at using HeapByteBuffers instead and don't
have to worry about the direct memory usage. The OpenHFT lib seems to
have support for this by just using elasticHeapByteBuffer(...) instead
of elasticByteBuffer().
Lastly, the reason for this working with non-generational ZGC is that it
does reference processing for every GC.
Hope this helps,
StefanJ
On 2024-02-15 21:53, Johannes Lichtenberger wrote:
> It's a laptop, I've attached some details.
>
> Furthermore, if it seems worth digging deeper into the issue, the JSON
> file is here for one week:
> https://www.transfernow.net/dl/20240215j9NaPTc0
> <https://www.transfernow.net/dl/20240215j9NaPTc0>
>
> You'd have to unzip into bundles/sirix-core/src/test/resources/json,
> remove the @Disabled annotation and run the test
> JsonShredderTest::testChicagoDescendantAxis
>
> The test JVM parameters are specified in the parent build.gradle in the
> project root folder.
>
> The GitHub repo: https://github.com/sirixdb/sirix
> <https://github.com/sirixdb/sirix>
>
> Screenshot from 2024-02-15 21-43-33.png
>
> kind regards
> Johannes
>
> Am Do., 15. Feb. 2024 um 20:01 Uhr schrieb Peter Booth
> <peter_booth at me.com <mailto:peter_booth at me.com>>:
>
> Just curious - what CPU, physical memory and OS are you using?
> Sent from my iPhone
>
>> On Feb 15, 2024, at 12:23 PM, Johannes Lichtenberger
>> <lichtenberger.johannes at gmail.com
>> <mailto:lichtenberger.johannes at gmail.com>> wrote:
>>
>>
>> I guess I don't know which JDK it picks for the tests, but I guess
>> OpenJDK
>>
>> Johannes Lichtenberger <lichtenberger.johannes at gmail.com
>> <mailto:lichtenberger.johannes at gmail.com>> schrieb am Do., 15.
>> Feb. 2024, 17:58:
>>
>> However, it's the same with: ./gradlew
>> -Dorg.gradle.java.home=/home/johannes/.jdks/openjdk-21.0.2
>> :sirix-core:test --tests
>> io.sirix.service.json.shredder.JsonShredderTest.testChicagoDescendantAxis using OpenJDK hopefully
>>
>> Am Do., 15. Feb. 2024 um 17:54 Uhr schrieb Johannes
>> Lichtenberger <lichtenberger.johannes at gmail.com
>> <mailto:lichtenberger.johannes at gmail.com>>:
>>
>> I've attached two logs, the first one without
>> -XX:+Generational, the second one with the option set,
>> even though I also saw, that generational ZGC is going to
>> be supported in GraalVM 24.1 in September... so not sure
>> what this does :)
>>
>> Am Do., 15. Feb. 2024 um 17:52 Uhr schrieb Johannes
>> Lichtenberger <lichtenberger.johannes at gmail.com
>> <mailto:lichtenberger.johannes at gmail.com>>:
>>
>> Strange, so does it simply ignore the option? The
>> following is the beginning of the output from _non_
>> generational ZGC:
>>
>> johannes at luna:~/IdeaProjects/sirix$ ./gradlew
>> -Dorg.gradle.java.home=/home/johannes/.sdkman/candidates/java/21.0.2-graal :sirix-core:test --tests io.sirix.service.json.shredder.JsonShredderTest.testChicagoDescendantAxis
>>
>> > Configure project :
>> The 'sonarqube' task depends on compile tasks. This
>> behavior is now deprecated and will be removed in
>> version 5.x. To avoid implicit compilation, set
>> property 'sonar.gradle.skipCompile' to 'true' and make
>> sure your project is compiled, before analysis has
>> started.
>> The 'sonar' task depends on compile tasks. This
>> behavior is now deprecated and will be removed in
>> version 5.x. To avoid implicit compilation, set
>> property 'sonar.gradle.skipCompile' to 'true' and make
>> sure your project is compiled, before analysis has
>> started.
>> [1,627s][info ][gc ] GC(0) Garbage Collection
>> (Metadata GC Threshold) 84M(1%)->56M(0%)
>>
>> > Task :sirix-core:test
>> [0.001s][warning][pagesize] UseLargePages disabled, no
>> large pages configured and available on the system.
>> [1.253s][info ][gc ] Using The Z Garbage Collector
>>
>> [2,930s][info ][gc ] GC(1) Garbage Collection
>> (Warmup) 1616M(11%)->746M(5%)
>> [4,445s][info ][gc ] GC(2) Garbage Collection
>> (Warmup) 3232M(21%)->750M(5%)
>> [5,751s][info ][gc ] GC(3) Garbage Collection
>> (Warmup) 4644M(30%)->1356M(9%)
>> [9,886s][info ][gc ] GC(4) Garbage Collection
>> (Allocation Rate) 10668M(69%)->612M(4%)
>> [10,406s][info ][gc ] GC(5) Garbage Collection
>> (Allocation Rate) 2648M(17%)->216M(1%)
>> [13,931s][info ][gc ] GC(6) Garbage Collection
>> (Allocation Rate) 11164M(73%)->1562M(10%)
>> [16,908s][info ][gc ] GC(7) Garbage Collection
>> (Allocation Rate) 11750M(76%)->460M(3%)
>> [20,690s][info ][gc ] GC(8) Garbage Collection
>> (Allocation Rate) 12670M(82%)->726M(5%)
>> [24,376s][info ][gc ] GC(9) Garbage Collection
>> (Allocation Rate) 13422M(87%)->224M(1%)
>> [28,152s][info ][gc ] GC(10) Garbage Collection
>> (Proactive) 13474M(88%)->650M(4%)
>> [31,526s][info ][gc ] GC(11) Garbage Collection
>> (Allocation Rate) 12072M(79%)->1472M(10%)
>> [34,754s][info ][gc ] GC(12) Garbage Collection
>> (Allocation Rate) 13050M(85%)->330M(2%)
>> [38,478s][info ][gc ] GC(13) Garbage Collection
>> (Allocation Rate) 13288M(87%)->762M(5%)
>> [41,936s][info ][gc ] GC(14) Garbage Collection
>> (Proactive) 13294M(87%)->504M(3%)
>> [45,353s][info ][gc ] GC(15) Garbage Collection
>> (Allocation Rate) 12984M(85%)->268M(2%)
>> [48,861s][info ][gc ] GC(16) Garbage Collection
>> (Allocation Rate) 13008M(85%)->306M(2%)
>> [52,133s][info ][gc ] GC(17) Garbage Collection
>> (Proactive) 12042M(78%)->538M(4%)
>> [55,705s][info ][gc ] GC(18) Garbage Collection
>> (Allocation Rate) 12420M(81%)->1842M(12%)
>> [59,000s][info ][gc ] GC(19) Garbage Collection
>> (Allocation Rate) 12458M(81%)->1422M(9%)
>> [64,501s][info ][gc ] Allocation Stall (Test
>> worker) 59,673ms
>> [64,742s][info ][gc ] Allocation Stall (Test
>> worker) 240,077ms
>> [65,806s][info ][gc ] GC(20) Garbage Collection
>> (Allocation Rate) 13808M(90%)->6936M(45%)
>> [66,476s][info ][gc ] GC(21) Garbage Collection
>> (Allocation Stall) 7100M(46%)->4478M(29%)
>> [69,471s][info ][gc ] GC(22) Garbage Collection
>> (Allocation Rate) 10098M(66%)->5888M(38%)
>> [72,252s][info ][gc ] GC(23) Garbage Collection
>> (Allocation Rate) 11226M(73%)->5816M(38%)
>>
>> ...
>>
>> So even here I can see some allocation stalls.
>>
>> Running the Same with -XX:+ZGenerational in
>> build.gradle probably using GraalVM does something
>> differnt, but I don't know what... at least off-heap
>> memory is exhausted at some point due to direct byte
>> buffer usage!?
>>
>> So, I'm not sure what's the difference, though.
>>
>> With this:
>>
>> "-XX:+UseZGC",
>> "-Xlog:gc*=debug:file=zgc-generational-detailed.log",
>> "-XX:+ZGenerational",
>> "-verbose:gc",
>> "-XX:+HeapDumpOnOutOfMemoryError",
>> "-XX:HeapDumpPath=heapdump.hprof",
>> "-XX:MaxDirectMemorySize=2g",
>>
>>
>> Caused by: java.lang.OutOfMemoryError: Cannot reserve 60000 bytes of direct buffer memory (allocated: 2147446560, limit: 2147483648)
>> at java.base/java.nio.Bits.reserveMemory(Bits.java:178)
>> at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:111)
>> at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:360)
>> at net.openhft.chronicle.bytes.internal.NativeBytesStore.elasticByteBuffer(NativeBytesStore.java:191)
>> at net.openhft.chronicle.bytes.BytesStore.elasticByteBuffer(BytesStore.java:192)
>> at net.openhft.chronicle.bytes.Bytes.elasticByteBuffer(Bytes.java:176)
>> at net.openhft.chronicle.bytes.Bytes.elasticByteBuffer(Bytes.java:148)
>> at io.sirix.access.trx.page.NodePageTrx.lambda$parallelSerializationOfKeyValuePages$1(NodePageTrx.java:443)
>>
>>
>>
>> Am Do., 15. Feb. 2024 um 12:05 Uhr schrieb Stefan
>> Karlsson <stefan.karlsson at oracle.com
>> <mailto:stefan.karlsson at oracle.com>>:
>>
>> Hi Johannes,
>>
>> We tried to look at the log files and the jfr
>> files, but couldn't find
>> an OotOfMemoryError in any of them. Do you think
>> you could try to rerun
>> and capture the entire GC log from the
>> OutOfMemoryError run?
>>
>> A few things to note:
>>
>> 1) You seem to be running the Graal compiler.
>> Graal doesn't support
>> Generational ZGC, so you are going to run
>> different compilers when you
>> compare Singlegen ZGC with Generational ZGC.
>>
>> 2) It's not clear to me that the provided JFR
>> files matches the provided
>> log files.
>>
>> 3) The JFR files show that -XX:+UseLargePages are
>> used, but the gc+init
>> logs shows 'Large Page Support: Disabled', you
>> might want to look into
>> why that is the case.
>>
>> 4) The singlegen JFR file has a
>> -Xlog:gc:g1-chicago.log line. It should
>> probably be named zgc-chicago.log.
>>
>> Cheers,
>> StefanK
>>
>> On 2024-02-14 17:36, Johannes Lichtenberger wrote:
>> > Hello,
>> >
>> > a test of my little DB project fails using
>> generational ZGC, but not
>> > with ZGC and G1 (out of memory error).
>> >
>> > To be honest, I guess the allocation rate and
>> thus GC pressure, when
>> > traversing a resource in SirixDB is
>> unacceptable. The strategy is to
>> > create fine-grained nodes from JSON input and
>> store these in a trie.
>> > First, a 3,8Gb JSON file is shredded and
>> imported. Next, a preorder
>> > traversal of the generated trie traverses a trie
>> (with leaf pages
>> > storing 1024 nodes each and in total
>> ~300_000_000 (and these are going
>> > to be deserialized one by one). The pages are
>> furthermore referenced
>> > in memory through PageReference::setPage.
>> Furthermore, a Caffeine page
>> > cache caches the PageReferences (keys) and the
>> pages (values) and sets
>> > the reference back to null once entries are
>> going to be evicted
>> > (PageReference.setPage(null)).
>> >
>> > However, I think the whole strategy of having to
>> have in-memory nodes
>> > might not be best. Maybe it's better to use
>> off-heap memory for the
>> > pages itself with MemorySegments, but the pages
>> are not of a fixed
>> > size, thus it may get tricky.
>> >
>> > The test mentioned is this:
>> >
>> https://github.com/sirixdb/sirix/blob/248ab141632c94c6484a3069a056550516afb1d2/bundles/sirix-core/src/test/java/io/sirix/service/json/shredder/JsonShredderTest.java#L69 <https://github.com/sirixdb/sirix/blob/248ab141632c94c6484a3069a056550516afb1d2/bundles/sirix-core/src/test/java/io/sirix/service/json/shredder/JsonShredderTest.java#L69>
>> >
>> > I can upload the JSON file somewhere for a
>> couple of days if needed.
>> >
>> > Caused by: java.lang.OutOfMemoryError
>> > at
>> >
>> java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
>> > at
>> >
>> java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
>> > at
>> >
>> java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486)
>> > at
>> >
>> java.base/java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:542)
>> > at
>> >
>> java.base/java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:567)
>> > at
>> >
>> java.base/java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:670)
>> > at
>> >
>> java.base/java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160)
>> > at
>> >
>> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
>> > at
>> >
>> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
>> > at
>> >
>> java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
>> > at
>> > io.sirix.access.trx.page
>> <http://io.sirix.access.trx.page>.NodePageTrx.parallelSerializationOfKeyValuePages(NodePageTrx.java:442)
>> >
>> > I've uploaded several JFR recordings and logs
>> over here (maybe besides
>> > the async profiler JFR files the zgc-detailed
>> log is most interesting):
>> >
>> >
>> https://github.com/sirixdb/sirix/tree/main/bundles/sirix-core <https://github.com/sirixdb/sirix/tree/main/bundles/sirix-core>
>> >
>> > kind regards
>> > Johannes
>>
More information about the zgc-dev
mailing list