From David.Tavoularis at mycom-osi.com Mon Jul 5 09:45:27 2021 From: David.Tavoularis at mycom-osi.com (David Tavoularis) Date: Mon, 05 Jul 2021 11:45:27 +0200 Subject: jmap histo:live different behavior with ZGC Message-ID: Hi, When using "jmap -histo:live ", the JVM triggers a Garbage Collection when G1 (default GC in Java16), but not when using ZGC. Is it an expected behavior or a bug ? Please note that "jcmd GC.run" correctly triggers a Garbage Collection when using ZGC. Is there a way to measure only live objects with ZGC ? Additional information from jmap usage about -histo[:live] : Prints a histogram of the heap. For each Java class, number of objects, memory size in bytes, and fully qualified class names are printed. VM internal class names are printed with '*' prefix. If the live suboption is specified, only live objects are counted. Best Regards -- David From erik.osterlund at oracle.com Mon Jul 5 18:09:26 2021 From: erik.osterlund at oracle.com (Erik Osterlund) Date: Mon, 5 Jul 2021 18:09:26 +0000 Subject: jmap histo:live different behavior with ZGC In-Reply-To: References: Message-ID: Hi David, The ZGC heap walker used to collect the stats, uses transitive traversal from roots, as opposed to heap parsing like the STW collectors. Doing a STW full GC and then parsing the heap, yields the same set of objects that heap walking by traversal does (ish). I suppose there is a slight difference that the traversal follows through non-strong references, while a full GC might clear them before parsing. Arguably the traversal strategy is more accurate as weakly reachable objects are still ?live? at the point the traversal happens. So other than a GC not showing up in the log, is there any actual observable difference in behaviour? Thanks, /Erik > On 5 Jul 2021, at 14:25, David Tavoularis wrote: > > ?Hi, > > When using "jmap -histo:live ", the JVM triggers a Garbage Collection when G1 (default GC in Java16), but not when using ZGC. Is it an expected behavior or a bug ? > Please note that "jcmd GC.run" correctly triggers a Garbage Collection when using ZGC. > Is there a way to measure only live objects with ZGC ? > > Additional information from jmap usage about -histo[:live] : Prints a histogram of the heap. For each Java class, number of objects, memory size in bytes, and fully qualified class names are printed. VM internal class names are printed with '*' prefix. If the live suboption is specified, only live objects are counted. > > Best Regards > -- > David From David.Tavoularis at mycom-osi.com Tue Jul 6 15:46:24 2021 From: David.Tavoularis at mycom-osi.com (David Tavoularis) Date: Tue, 06 Jul 2021 17:46:24 +0200 Subject: jmap histo:live different behavior with ZGC In-Reply-To: References: Message-ID: Hi Erik, > So other than a GC not showing up in the log, is there any actual > observable difference in behaviour? None. Your explanation makes perfectly sense. What I am trying to achieve in labs is to measure the max retained memory by doing regular "jmap -histo:live" during my workload. I wanted to make sure that garbage-collectable objects were not counted by this command when using ZGC. Best Regards David On Mon, 05 Jul 2021 20:09:26 +0200, Erik Osterlund wrote: > Hi David, > > The ZGC heap walker used to collect the stats, uses transitive traversal > from roots, as opposed to heap parsing like the STW collectors. Doing a > STW full GC and then parsing the heap, yields the same set of objects > that heap walking by traversal does (ish). I suppose there is a slight > difference that the traversal follows through non-strong references, > while a full GC might clear them before parsing. Arguably the traversal > strategy is more accurate as weakly reachable objects are still ?live? > at the point the traversal happens. > > So other than a GC not showing up in the log, is there any actual > observable difference in behaviour? > > Thanks, > /Erik > >> On 5 Jul 2021, at 14:25, David Tavoularis >> wrote: >> >> ?Hi, >> >> When using "jmap -histo:live ", the JVM triggers a Garbage >> Collection when G1 (default GC in Java16), but not when using ZGC. Is >> it an expected behavior or a bug ? >> Please note that "jcmd GC.run" correctly triggers a Garbage >> Collection when using ZGC. >> Is there a way to measure only live objects with ZGC ? >> >> Additional information from jmap usage about -histo[:live] : Prints a >> histogram of the heap. For each Java class, number of objects, memory >> size in bytes, and fully qualified class names are printed. VM internal >> class names are printed with '*' prefix. If the live suboption is >> specified, only live objects are counted. >> >> Best Regards >> -- >> David From roy.sunny.zhang007 at gmail.com Wed Jul 7 12:12:16 2021 From: roy.sunny.zhang007 at gmail.com (Roy Zhang) Date: Wed, 7 Jul 2021 20:12:16 +0800 Subject: JDK11 + ZGC JVM Crash Message-ID: Dear ZGC Experts, In our 32G heap Kafka server, our JVM crashed twice, is it due to known ZGC issues? Thanks in advance! JDK version: OpenJDK 64-Bit Server VM Corretto-11.0.10.9.1 Excerpt of JVM Crash log: # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007ffa5b7ea200, pid=4649, tid=4951 # # JRE version: OpenJDK Runtime Environment Corretto-11.0.10.9.1 (11.0.10+9) (build 11.0.10+9-LTS) # Java VM: OpenJDK 64-Bit Server VM Corretto-11.0.10.9.1 (11.0.10+9-LTS, mixed mode, tiered, z gc, linux-amd64) # Problematic frame: # V [libjvm.so+0xf14200] ZPage::relocate_object_inner(unsigned long, unsigned long)+0x90 # # No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # If you would like to submit a bug report, please visit: # https://github.com/corretto/corretto-11/issues/ # --------------- S U M M A R Y ------------ Command Line: -Xmx32G -Xms32G -XX:+UnlockExperimentalVMOptions -XX:+UseZGC -Xlog:gc*:file=/server/kafka-2.1.1/bin/../logs/kafkaServer-gc.log:time,tags:filecount=10,filesize=102400 -Dcom.sun.management.jmxremote=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.net.preferIPv4Stack=tr -Dkafka.logs.dir=/server/kafka-2.1.1/bin/../logs -Dlog4j.configuration=file:/server/kafka-2.1.1/bin/../config/log4j.properties -javaagent:/server/jmx_exporter/jmx_prometheus_javaagent-0.3.1.jar=9900:/server/jmx_exporter/kafka.yml kafka.Kafka /server/kafka-2.1.1/config/server.properties Host: Intel(R) Xeon(R) Platinum 8252C CPU @ 3.80GHz, 48 cores, 184G, Amazon Linux release 2 (Karoo) Time: Fri Jun 18 03:21:39 2021 UTC elapsed time: 9338153.258651 seconds (108d 1h 55m 53s) --------------- T H R E A D --------------- Current thread (0x00007ffa54074ff0): GCTaskThread "ZWorker#23" [stack: 0x00007ffa3c084000,0x00007ffa3c184000] [id=4951] Stack: [0x00007ffa3c084000,0x00007ffa3c184000], sp=0x00007ffa3c182bc0, free space=1018k Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0xf14200] ZPage::relocate_object_inner(unsigned long, unsigned long)+0x90 V [libjvm.so+0xf1465e] ZPage::relocate_object(unsigned long)+0x3e V [libjvm.so+0xf04ac7] ZHeap::relocate_object(unsigned long)+0x77 V [libjvm.so+0xef0c5a] ZRelocateRootOopClosure::do_oop(oopDesc**)+0x6a V [libjvm.so+0xf0f066] ZNMethodTable::oops_do(OopClosure*)+0x116 V [libjvm.so+0xf1c97a] ZRootsIterator::oops_do(OopClosure*, bool)+0x43a V [libjvm.so+0xf1aa15] ZRelocateRootsTask::work()+0x25 V [libjvm.so+0xf2679c] ZTask::GangTask::work(unsigned int)+0x1c V [libjvm.so+0xee7b83] GangWorker::loop()+0x43 V [libjvm.so+0xe502dd] Thread::call_run()+0x14d V [libjvm.so+0xc0469c] thread_native_entry(Thread*)+0xec Thanks, Roy From lsc1943 at gmail.com Fri Jul 9 04:11:33 2021 From: lsc1943 at gmail.com (Jack Ling) Date: Fri, 9 Jul 2021 12:11:33 +0800 Subject: What's the Difference of Proceeding Non-strong References Between ZGC and G1 Message-ID: Dear ZGC experts, Recently we compared ZGC and G1 on JDK 11 for our application and found one big difference for reference proceeding between the two difference GC types. We captured the JFR (Java Flight Recorder) during the AB test, and observed from JDK reference statistics that using ZGC would proceed much more Non-strong references than G1 during each GC phase. (Thousands of weak references and hundreds of soft references were proceeded in ZGC but only a few in G1). Any difference between ZGC and G1 to handle those non-strong references? Another question is that from GC log, the heap size after GC using ZGC was much higher than G1 which was not reasonable to me, as far as I know, ZGC does not have old (survivor) space so it should collect more garbage and the heap size after GC should be lower than G1 GC. Any explanation why we observed the opposite GC logs like below. (The GC logs captured under the exactly same workload for each application instance). ZGC: GC(1721) Garbage Collection (Allocation Rate) 2066M(42%)->834M(17%) G1: GC(864) Pause Young (Normal) (G1 Evacuation Pause) 3286M->342M(4916M) JVM parameters: ZGC, -XX:+UnlockExperimentalVMOptions -XX:+UseZGC -XX:ZAllocationSpikeTolerance=5 (We set this parameter for more aggressive GC operations to reserve memory for upcoming peak workload in production, not sure if it's recommended) G1, -XX:+UseG1GC -XX:MaxGCPauseMillis=100 Thank you for the help! Best Regards! Jack From roy.sunny.zhang007 at gmail.com Fri Jul 9 11:16:55 2021 From: roy.sunny.zhang007 at gmail.com (Roy Zhang) Date: Fri, 9 Jul 2021 19:16:55 +0800 Subject: JDK11 + ZGC JVM Crash In-Reply-To: References: Message-ID: Dear ZGC experts, Is there a JVM crash related to https://bugs.openjdk.java.net/browse/JDK-8215487? If yes, do we have any workaround in JDK11? Thanks, Roy On Wed, Jul 7, 2021 at 8:12 PM Roy Zhang wrote: > Dear ZGC Experts, > > In our 32G heap Kafka server, our JVM crashed twice, is it due to known > ZGC issues? Thanks in advance! > JDK version: OpenJDK 64-Bit Server VM Corretto-11.0.10.9.1 > > Excerpt of JVM Crash log: > > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGSEGV (0xb) at pc=0x00007ffa5b7ea200, pid=4649, tid=4951 > # > # JRE version: OpenJDK Runtime Environment Corretto-11.0.10.9.1 > (11.0.10+9) (build 11.0.10+9-LTS) > # Java VM: OpenJDK 64-Bit Server VM Corretto-11.0.10.9.1 (11.0.10+9-LTS, > mixed mode, tiered, z gc, linux-amd64) > # Problematic frame: > # V [libjvm.so+0xf14200] ZPage::relocate_object_inner(unsigned long, > unsigned long)+0x90 > # > # No core dump will be written. Core dumps have been disabled. To enable > core dumping, try "ulimit -c unlimited" before starting Java again > # > # If you would like to submit a bug report, please visit: > # https://github.com/corretto/corretto-11/issues/ > # > > --------------- S U M M A R Y ------------ > > Command Line: -Xmx32G -Xms32G -XX:+UnlockExperimentalVMOptions -XX:+UseZGC > -Xlog:gc*:file=/server/kafka-2.1.1/bin/../logs/kafkaServer-gc.log:time,tags:filecount=10,filesize=102400 > -Dcom.sun.management.jmxremote=false > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false -Djava.net.preferIPv4Stack=tr > -Dkafka.logs.dir=/server/kafka-2.1.1/bin/../logs > -Dlog4j.configuration=file:/server/kafka-2.1.1/bin/../config/log4j.properties > -javaagent:/server/jmx_exporter/jmx_prometheus_javaagent-0.3.1.jar=9900:/server/jmx_exporter/kafka.yml > kafka.Kafka /server/kafka-2.1.1/config/server.properties > > Host: Intel(R) Xeon(R) Platinum 8252C CPU @ 3.80GHz, 48 cores, 184G, > Amazon Linux release 2 (Karoo) > Time: Fri Jun 18 03:21:39 2021 UTC elapsed time: 9338153.258651 seconds > (108d 1h 55m 53s) > > --------------- T H R E A D --------------- > > Current thread (0x00007ffa54074ff0): GCTaskThread "ZWorker#23" [stack: > 0x00007ffa3c084000,0x00007ffa3c184000] [id=4951] > > Stack: [0x00007ffa3c084000,0x00007ffa3c184000], sp=0x00007ffa3c182bc0, > free space=1018k > Native frames: (J=compiled Java code, A=aot compiled Java code, > j=interpreted, Vv=VM code, C=native code) > V [libjvm.so+0xf14200] ZPage::relocate_object_inner(unsigned long, > unsigned long)+0x90 > V [libjvm.so+0xf1465e] ZPage::relocate_object(unsigned long)+0x3e > V [libjvm.so+0xf04ac7] ZHeap::relocate_object(unsigned long)+0x77 > V [libjvm.so+0xef0c5a] ZRelocateRootOopClosure::do_oop(oopDesc**)+0x6a > V [libjvm.so+0xf0f066] ZNMethodTable::oops_do(OopClosure*)+0x116 > V [libjvm.so+0xf1c97a] ZRootsIterator::oops_do(OopClosure*, bool)+0x43a > V [libjvm.so+0xf1aa15] ZRelocateRootsTask::work()+0x25 > V [libjvm.so+0xf2679c] ZTask::GangTask::work(unsigned int)+0x1c > V [libjvm.so+0xee7b83] GangWorker::loop()+0x43 > V [libjvm.so+0xe502dd] Thread::call_run()+0x14d > V [libjvm.so+0xc0469c] thread_native_entry(Thread*)+0xec > > > Thanks, > Roy > From charlie.hunt at oracle.com Fri Jul 9 19:55:06 2021 From: charlie.hunt at oracle.com (Charlie Hunt) Date: Fri, 9 Jul 2021 19:55:06 +0000 Subject: JDK11 + ZGC JVM Crash In-Reply-To: References: Message-ID: Hi Roy, Thanks for the feedback on running ZGC in JDK 11. Just to remind you that ZGC is an experimental feature in JDK 11. ZGC became non-experimental in JDK 15. What this means is ... if you are experiencing an issue with ZGC in JDK 11 thru JDK 14 where ZGC was an experimental feature, the appropriate resolution would be to migrate to one of; [ideally] the most recent JDK 15 update release, or [ideally] the most recent JDK 16 update release. There were a very large number of improvements in ZGC between JDK 11 ZGC and JDK 15 ZGC, and likewise JDK 16. If you experience an issue with ZGC in either JDK 15 or JDK 16, please let us know. thanks, Charlie Hunt ________________________________ From: zgc-dev on behalf of Roy Zhang Sent: Wednesday, July 7, 2021 7:12 AM To: zgc-dev at openjdk.java.net Subject: JDK11 + ZGC JVM Crash Dear ZGC Experts, In our 32G heap Kafka server, our JVM crashed twice, is it due to known ZGC issues? Thanks in advance! JDK version: OpenJDK 64-Bit Server VM Corretto-11.0.10.9.1 Excerpt of JVM Crash log: # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007ffa5b7ea200, pid=4649, tid=4951 # # JRE version: OpenJDK Runtime Environment Corretto-11.0.10.9.1 (11.0.10+9) (build 11.0.10+9-LTS) # Java VM: OpenJDK 64-Bit Server VM Corretto-11.0.10.9.1 (11.0.10+9-LTS, mixed mode, tiered, z gc, linux-amd64) # Problematic frame: # V [libjvm.so+0xf14200] ZPage::relocate_object_inner(unsigned long, unsigned long)+0x90 # # No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # If you would like to submit a bug report, please visit: # https://github.com/corretto/corretto-11/issues/ # --------------- S U M M A R Y ------------ Command Line: -Xmx32G -Xms32G -XX:+UnlockExperimentalVMOptions -XX:+UseZGC -Xlog:gc*:file=/server/kafka-2.1.1/bin/../logs/kafkaServer-gc.log:time,tags:filecount=10,filesize=102400 -Dcom.sun.management.jmxremote=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.net.preferIPv4Stack=tr -Dkafka.logs.dir=/server/kafka-2.1.1/bin/../logs -Dlog4j.configuration=file:/server/kafka-2.1.1/bin/../config/log4j.properties -javaagent:/server/jmx_exporter/jmx_prometheus_javaagent-0.3.1.jar=9900:/server/jmx_exporter/kafka.yml kafka.Kafka /server/kafka-2.1.1/config/server.properties Host: Intel(R) Xeon(R) Platinum 8252C CPU @ 3.80GHz, 48 cores, 184G, Amazon Linux release 2 (Karoo) Time: Fri Jun 18 03:21:39 2021 UTC elapsed time: 9338153.258651 seconds (108d 1h 55m 53s) --------------- T H R E A D --------------- Current thread (0x00007ffa54074ff0): GCTaskThread "ZWorker#23" [stack: 0x00007ffa3c084000,0x00007ffa3c184000] [id=4951] Stack: [0x00007ffa3c084000,0x00007ffa3c184000], sp=0x00007ffa3c182bc0, free space=1018k Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0xf14200] ZPage::relocate_object_inner(unsigned long, unsigned long)+0x90 V [libjvm.so+0xf1465e] ZPage::relocate_object(unsigned long)+0x3e V [libjvm.so+0xf04ac7] ZHeap::relocate_object(unsigned long)+0x77 V [libjvm.so+0xef0c5a] ZRelocateRootOopClosure::do_oop(oopDesc**)+0x6a V [libjvm.so+0xf0f066] ZNMethodTable::oops_do(OopClosure*)+0x116 V [libjvm.so+0xf1c97a] ZRootsIterator::oops_do(OopClosure*, bool)+0x43a V [libjvm.so+0xf1aa15] ZRelocateRootsTask::work()+0x25 V [libjvm.so+0xf2679c] ZTask::GangTask::work(unsigned int)+0x1c V [libjvm.so+0xee7b83] GangWorker::loop()+0x43 V [libjvm.so+0xe502dd] Thread::call_run()+0x14d V [libjvm.so+0xc0469c] thread_native_entry(Thread*)+0xec Thanks, Roy From rkennke at redhat.com Tue Jul 20 11:47:08 2021 From: rkennke at redhat.com (Roman Kennke) Date: Tue, 20 Jul 2021 13:47:08 +0200 Subject: Need help with ZGC failure in Lilliput Message-ID: Hi ZGC devs, I am struggling with a ZGC problem in Lilliput, and would like to ask for your opinion. I'm currently working on changing runtime oopDesc::klass() to load the Klass* from the object header instead of the dedicated Klass* field: https://github.com/openjdk/lilliput/pull/12 This required some coordination in other GCs, because it's not always safe to access the object header. In particular, objects may be locked, at which point we need to find the displaced header, or worst case, inflate the header. I believe I've solved that in all GCs. However, I am still getting a failure with ZGC, which is kinda unexpected, because it's the only GC that is *not* messing with object headers (as far as I know. If you check out the above PR, the failure can easily reproduced with: make run-test TEST=gc/z/TestGarbageCollectorMXBean.java (and only that test is failing for me). The crash is in ZHeap::is_object_live() because the ZPage there turns out to be NULL. I've added a bunch of debug output in that location, and it looks like the offending object is always inflated *and* forwarded when it happens, but I fail to see how this is related to each other, and to the page being NULL. I strongly suspect that inflation of the object header by calling klass() on it causes the troubles. Changing back to original implementation of oopDesc::klass() (swap commented-out-code there) makes the bug disappear. Also, the bug always seems to happen when calling through a weak barrier. Not sure if that is relevant. Any ideas? Opinions? Thanks, Roman From erik.osterlund at oracle.com Tue Jul 20 14:48:32 2021 From: erik.osterlund at oracle.com (=?UTF-8?Q?Erik_=c3=96sterlund?=) Date: Tue, 20 Jul 2021 16:48:32 +0200 Subject: Need help with ZGC failure in Lilliput In-Reply-To: References: Message-ID: <1717fcdc-1eb4-97ac-6361-d44a6bf27457@oracle.com> Hi Roman, Might need to catch up a bit with what you have done in detail. However, rather than studying your changes in detail, I'm just gonna throw out a few things that I know would blow up when doing something like this, unless you have done something to fix that. The main theme is displaced mark words. Due to issues with displaced mark words, I don't currently know of any (good and performant) way of having any stable bits in the markWord that can be read by mutators, when there are displaced mark words. This is why in the generational version of ZGC we are currently building, we do not encode any age bits in the markWord. Instead it is encoded as a per-region/page property, which is more reliable, and can use much lighter synchronization. Anyway, here comes a few displaced mark word issues: 1) A displaced mark word can point into an ObjectMonitor. However, the monitor can be concurrently deflated, and subsequently freed. The safe memory reclamation policy for ObjectMonitor unlinks the monitors first, then performs a thread-local handshake with all (...?!) threads, and then frees them after the handshake, when it knows that surely nobody is looking at these monitors any longer. Except of course concurrent GC threads do not take part in such handshakes, and therefore, concurrent GC threads are suddenly unable to safely read klasses, through displaced mark words, pointing into concurrently freeing ObjectMonitors. It can result in use-after-free. They are basically not allowed to dereference ObjectMonitor, without more synchronization code to allow that. 2) A displaced mark word can also point into a stack lock, right into the stack of a concurrently running thread. Naturally, this thread can concurrently die, and its stack be deallocated, or concurrently mutated after the lock is released on that thread. In other words, the memory of the stack on other threads is completely unreliable. The way in which this works regarding hashCode, which similarly needs to be read by various parties, is that the stack lock is concurrently inflated into an inflated lock, which is then a bit more stable to read through, given the right sync dance. Assuming of course, that the reading thread, takes part in the global handshake for SMR purposes. So yeah, not sure if you have thought about any of this. If not, it might be the issue you are chasing after. It's worth mentioning that Robbin Ehn is currently removing displaced mark words with his Java monitors work. That should make this kind of exercise easier. Thanks, /Erik On 2021-07-20 13:47, Roman Kennke wrote: > Hi ZGC devs, > > I am struggling with a ZGC problem in Lilliput, and would like to ask > for your opinion. > > I'm currently working on changing runtime oopDesc::klass() to load the > Klass* from the object header instead of the dedicated Klass* field: > > https://github.com/openjdk/lilliput/pull/12 > > This required some coordination in other GCs, because it's not always > safe to access the object header. In particular, objects may be locked, > at which point we need to find the displaced header, or worst case, > inflate the header. I believe I've solved that in all GCs. > > However, I am still getting a failure with ZGC, which is kinda > unexpected, because it's the only GC that is *not* messing with object > headers (as far as I know. If you check out the above PR, the failure > can easily reproduced with: > > make run-test TEST=gc/z/TestGarbageCollectorMXBean.java > > (and only that test is failing for me). > > The crash is in ZHeap::is_object_live() because the ZPage there turns > out to be NULL. I've added a bunch of debug output in that location, and > it looks like the offending object is always inflated *and* forwarded > when it happens, but I fail to see how this is related to each other, > and to the page being NULL. I strongly suspect that inflation of the > object header by calling klass() on it causes the troubles. Changing > back to original implementation of oopDesc::klass() (swap > commented-out-code there) makes the bug disappear. > > Also, the bug always seems to happen when calling through a weak > barrier. Not sure if that is relevant. > > Any ideas? Opinions? > > Thanks, > Roman > From rkennke at redhat.com Tue Jul 20 16:21:56 2021 From: rkennke at redhat.com (Roman Kennke) Date: Tue, 20 Jul 2021 18:21:56 +0200 Subject: Need help with ZGC failure in Lilliput In-Reply-To: <1717fcdc-1eb4-97ac-6361-d44a6bf27457@oracle.com> References: <1717fcdc-1eb4-97ac-6361-d44a6bf27457@oracle.com> Message-ID: Hi Erik, yes I have thought about this, but I am not sure if what I do is enough. I'm basically following the logic that is implemented for hash-code: if we encounter a stack-lock in the current thread, we can directly load the displaced-header from it, otherwise it inflates the lock. I guess that this is not really going perfect with GCs, because it means that, e.g., concurrent marking or relocation would inflate some locks. The trouble might indeed be that GC threads would not be allowed to do that because of concurrent deflation. Hrmpf. Is there a way to prevent deflation during certain GC phases maybe? Or coordinate GC threads with deflation? Robbin Ehn's work sounds promising. Can you give me more details about what he's up to? Maybe a PR? Thanks, Roman > Hi Roman, > > Might need to catch up a bit with what you have done in detail. However, > rather than studying your changes in detail, I'm just gonna throw out a > few things that I know would blow up when doing something like this, > unless you have done something to fix that. > > The main theme is displaced mark words. Due to issues with displaced > mark words, I don't currently know of any (good and performant) way of > having any stable bits in the markWord that can be read by mutators, > when there are displaced mark words. This is why in the generational > version of ZGC we are currently building, we do not encode any age bits > in the markWord. Instead it is encoded as a per-region/page property, > which is more reliable, and can use much lighter synchronization. > Anyway, here comes a few displaced mark word issues: > > 1) A displaced mark word can point into an ObjectMonitor. However, the > monitor can be concurrently deflated, and subsequently freed. The safe > memory reclamation policy for ObjectMonitor unlinks the monitors first, > then performs a thread-local handshake with all (...?!) threads, and > then frees them after the handshake, when it knows that surely nobody is > looking at these monitors any longer. Except of course concurrent GC > threads do not take part in such handshakes, and therefore, concurrent > GC threads are suddenly unable to safely read klasses, through displaced > mark words, pointing into concurrently freeing ObjectMonitors. It can > result in use-after-free. They are basically not allowed to dereference > ObjectMonitor, without more synchronization code to allow that. > > 2) A displaced mark word can also point into a stack lock, right into > the stack of a concurrently running thread. Naturally, this thread can > concurrently die, and its stack be deallocated, or concurrently mutated > after the lock is released on that thread. In other words, the memory of > the stack on other threads is completely unreliable. The way in which > this works regarding hashCode, which similarly needs to be read by > various parties, is that the stack lock is concurrently inflated into an > inflated lock, which is then a bit more stable to read through, given > the right sync dance. Assuming of course, that the reading thread, takes > part in the global handshake for SMR purposes. > > So yeah, not sure if you have thought about any of this. If not, it > might be the issue you are chasing after. It's worth mentioning that > Robbin Ehn is currently removing displaced mark words with his Java > monitors work. That should make this kind of exercise easier. > > Thanks, > /Erik > > On 2021-07-20 13:47, Roman Kennke wrote: >> Hi ZGC devs, >> >> I am struggling with a ZGC problem in Lilliput, and would like to ask >> for your opinion. >> >> I'm currently working on changing runtime oopDesc::klass() to load the >> Klass* from the object header instead of the dedicated Klass* field: >> >> https://github.com/openjdk/lilliput/pull/12 >> >> This required some coordination in other GCs, because it's not always >> safe to access the object header. In particular, objects may be >> locked, at which point we need to find the displaced header, or worst >> case, inflate the header. I believe I've solved that in all GCs. >> >> However, I am still getting a failure with ZGC, which is kinda >> unexpected, because it's the only GC that is *not* messing with object >> headers (as far as I know. If you check out the above PR, the failure >> can easily reproduced with: >> >> make run-test TEST=gc/z/TestGarbageCollectorMXBean.java >> >> (and only that test is failing for me). >> >> The crash is in ZHeap::is_object_live() because the ZPage there turns >> out to be NULL. I've added a bunch of debug output in that location, >> and it looks like the offending object is always inflated *and* >> forwarded when it happens, but I fail to see how this is related to >> each other, and to the page being NULL. I strongly suspect that >> inflation of the object header by calling klass() on it causes the >> troubles. Changing back to original implementation of oopDesc::klass() >> (swap commented-out-code there) makes the bug disappear. >> >> Also, the bug always seems to happen when calling through a weak >> barrier. Not sure if that is relevant. >> >> Any ideas? Opinions? >> >> Thanks, >> Roman >> > From rkennke at redhat.com Tue Jul 20 19:08:47 2021 From: rkennke at redhat.com (Roman Kennke) Date: Tue, 20 Jul 2021 21:08:47 +0200 Subject: Need help with ZGC failure in Lilliput In-Reply-To: References: <1717fcdc-1eb4-97ac-6361-d44a6bf27457@oracle.com> Message-ID: Alright, I disabled monitor deflation, and the problem persists. I narrowed the problem down to two (closely related) calls to ZUtils::object_size() (which in turn calls oopDesc::size() which calls oopDesc::klass()), which seem to trigger it: ZLiveMap::iterate() and ZRelocateClosure::relocate_object(). When I change those two calls to use the Klass* from the dedicated field, the failure disappears. I don't think that the actual Klass* disagrees (I have asserts to check that), which means it can only be the side effect of that call -- most likely the header inflation. How this inflation would affect ZGC is still not entirely clear to me. The problem always happens during weak storage scan. Inflation of a lock creates new weak handle that points back to the object. I suspect that, at the very least, this must not point to the old copy of that object. I need to think this through. Roman > Hi Erik, > > yes I have thought about this, but I am not sure if what I do is enough. > I'm basically following the logic that is implemented for hash-code: if > we encounter a stack-lock in the current thread, we can directly load > the displaced-header from it, otherwise it inflates the lock. I guess > that this is not really going perfect with GCs, because it means that, > e.g., concurrent marking or relocation would inflate some locks. > > The trouble might indeed be that GC threads would not be allowed to do > that because of concurrent deflation. Hrmpf. Is there a way to prevent > deflation during certain GC phases maybe? Or coordinate GC threads with > deflation? > > Robbin Ehn's work sounds promising. Can you give me more details about > what he's up to? Maybe a PR? > > Thanks, > Roman > >> Hi Roman, >> >> Might need to catch up a bit with what you have done in detail. >> However, rather than studying your changes in detail, I'm just gonna >> throw out a few things that I know would blow up when doing something >> like this, unless you have done something to fix that. >> >> The main theme is displaced mark words. Due to issues with displaced >> mark words, I don't currently know of any (good and performant) way of >> having any stable bits in the markWord that can be read by mutators, >> when there are displaced mark words. This is why in the generational >> version of ZGC we are currently building, we do not encode any age >> bits in the markWord. Instead it is encoded as a per-region/page >> property, which is more reliable, and can use much lighter >> synchronization. Anyway, here comes a few displaced mark word issues: >> >> 1) A displaced mark word can point into an ObjectMonitor. However, the >> monitor can be concurrently deflated, and subsequently freed. The safe >> memory reclamation policy for ObjectMonitor unlinks the monitors >> first, then performs a thread-local handshake with all (...?!) >> threads, and then frees them after the handshake, when it knows that >> surely nobody is looking at these monitors any longer. Except of >> course concurrent GC threads do not take part in such handshakes, and >> therefore, concurrent GC threads are suddenly unable to safely read >> klasses, through displaced mark words, pointing into concurrently >> freeing ObjectMonitors. It can result in use-after-free. They are >> basically not allowed to dereference ObjectMonitor, without more >> synchronization code to allow that. >> >> 2) A displaced mark word can also point into a stack lock, right into >> the stack of a concurrently running thread. Naturally, this thread can >> concurrently die, and its stack be deallocated, or concurrently >> mutated after the lock is released on that thread. In other words, the >> memory of the stack on other threads is completely unreliable. The way >> in which this works regarding hashCode, which similarly needs to be >> read by various parties, is that the stack lock is concurrently >> inflated into an inflated lock, which is then a bit more stable to >> read through, given the right sync dance. Assuming of course, that the >> reading thread, takes part in the global handshake for SMR purposes. >> >> So yeah, not sure if you have thought about any of this. If not, it >> might be the issue you are chasing after. It's worth mentioning that >> Robbin Ehn is currently removing displaced mark words with his Java >> monitors work. That should make this kind of exercise easier. >> >> Thanks, >> /Erik >> >> On 2021-07-20 13:47, Roman Kennke wrote: >>> Hi ZGC devs, >>> >>> I am struggling with a ZGC problem in Lilliput, and would like to ask >>> for your opinion. >>> >>> I'm currently working on changing runtime oopDesc::klass() to load >>> the Klass* from the object header instead of the dedicated Klass* field: >>> >>> https://github.com/openjdk/lilliput/pull/12 >>> >>> This required some coordination in other GCs, because it's not always >>> safe to access the object header. In particular, objects may be >>> locked, at which point we need to find the displaced header, or worst >>> case, inflate the header. I believe I've solved that in all GCs. >>> >>> However, I am still getting a failure with ZGC, which is kinda >>> unexpected, because it's the only GC that is *not* messing with >>> object headers (as far as I know. If you check out the above PR, the >>> failure can easily reproduced with: >>> >>> make run-test TEST=gc/z/TestGarbageCollectorMXBean.java >>> >>> (and only that test is failing for me). >>> >>> The crash is in ZHeap::is_object_live() because the ZPage there turns >>> out to be NULL. I've added a bunch of debug output in that location, >>> and it looks like the offending object is always inflated *and* >>> forwarded when it happens, but I fail to see how this is related to >>> each other, and to the page being NULL. I strongly suspect that >>> inflation of the object header by calling klass() on it causes the >>> troubles. Changing back to original implementation of >>> oopDesc::klass() (swap commented-out-code there) makes the bug >>> disappear. >>> >>> Also, the bug always seems to happen when calling through a weak >>> barrier. Not sure if that is relevant. >>> >>> Any ideas? Opinions? >>> >>> Thanks, >>> Roman >>> >> From erik.osterlund at oracle.com Tue Jul 20 21:23:08 2021 From: erik.osterlund at oracle.com (Erik Osterlund) Date: Tue, 20 Jul 2021 21:23:08 +0000 Subject: [External] : Re: Need help with ZGC failure in Lilliput In-Reply-To: References: <1717fcdc-1eb4-97ac-6361-d44a6bf27457@oracle.com> Message-ID: Hi Roman, > Alright, I disabled monitor deflation, and the problem persists. I see. Naturally, there may be multiple problems. > I narrowed the problem down to two (closely related) calls to > ZUtils::object_size() (which in turn calls oopDesc::size() which calls > oopDesc::klass()), which seem to trigger it: ZLiveMap::iterate() and > ZRelocateClosure::relocate_object(). When I change those two calls to use the > Klass* from the dedicated field, the failure disappears. I don't think that the > actual Klass* disagrees (I have asserts to check that), which means it can only be > the side effect of that call -- most likely the header inflation. How this inflation > would affect ZGC is still not entirely clear to me. Okay. > The problem always happens during weak storage scan. Inflation of a lock > creates new weak handle that points back to the object. I suspect that, at the > very least, this must not point to the old copy of that object. I need to think this > through. Is the lock inflation provoked by the GC thread itself, as a side effect of reading the klass? /Erik > Roman > > > Hi Erik, > > > > yes I have thought about this, but I am not sure if what I do is enough. > > I'm basically following the logic that is implemented for hash-code: > > if we encounter a stack-lock in the current thread, we can directly > > load the displaced-header from it, otherwise it inflates the lock. I > > guess that this is not really going perfect with GCs, because it means > > that, e.g., concurrent marking or relocation would inflate some locks. > > > > The trouble might indeed be that GC threads would not be allowed to do > > that because of concurrent deflation. Hrmpf. Is there a way to prevent > > deflation during certain GC phases maybe? Or coordinate GC threads > > with deflation? > > > > Robbin Ehn's work sounds promising. Can you give me more details about > > what he's up to? Maybe a PR? > > > > Thanks, > > Roman > > > >> Hi Roman, > >> > >> Might need to catch up a bit with what you have done in detail. > >> However, rather than studying your changes in detail, I'm just gonna > >> throw out a few things that I know would blow up when doing something > >> like this, unless you have done something to fix that. > >> > >> The main theme is displaced mark words. Due to issues with displaced > >> mark words, I don't currently know of any (good and performant) way > >> of having any stable bits in the markWord that can be read by > >> mutators, when there are displaced mark words. This is why in the > >> generational version of ZGC we are currently building, we do not > >> encode any age bits in the markWord. Instead it is encoded as a > >> per-region/page property, which is more reliable, and can use much > >> lighter synchronization. Anyway, here comes a few displaced mark word > issues: > >> > >> 1) A displaced mark word can point into an ObjectMonitor. However, > >> the monitor can be concurrently deflated, and subsequently freed. The > >> safe memory reclamation policy for ObjectMonitor unlinks the monitors > >> first, then performs a thread-local handshake with all (...?!) > >> threads, and then frees them after the handshake, when it knows that > >> surely nobody is looking at these monitors any longer. Except of > >> course concurrent GC threads do not take part in such handshakes, and > >> therefore, concurrent GC threads are suddenly unable to safely read > >> klasses, through displaced mark words, pointing into concurrently > >> freeing ObjectMonitors. It can result in use-after-free. They are > >> basically not allowed to dereference ObjectMonitor, without more > >> synchronization code to allow that. > >> > >> 2) A displaced mark word can also point into a stack lock, right into > >> the stack of a concurrently running thread. Naturally, this thread > >> can concurrently die, and its stack be deallocated, or concurrently > >> mutated after the lock is released on that thread. In other words, > >> the memory of the stack on other threads is completely unreliable. > >> The way in which this works regarding hashCode, which similarly needs > >> to be read by various parties, is that the stack lock is concurrently > >> inflated into an inflated lock, which is then a bit more stable to > >> read through, given the right sync dance. Assuming of course, that > >> the reading thread, takes part in the global handshake for SMR purposes. > >> > >> So yeah, not sure if you have thought about any of this. If not, it > >> might be the issue you are chasing after. It's worth mentioning that > >> Robbin Ehn is currently removing displaced mark words with his Java > >> monitors work. That should make this kind of exercise easier. > >> > >> Thanks, > >> /Erik > >> > >> On 2021-07-20 13:47, Roman Kennke wrote: > >>> Hi ZGC devs, > >>> > >>> I am struggling with a ZGC problem in Lilliput, and would like to > >>> ask for your opinion. > >>> > >>> I'm currently working on changing runtime oopDesc::klass() to load > >>> the Klass* from the object header instead of the dedicated Klass* field: > >>> > >>> https://urldefense.com/v3/__https://github.com/openjdk/lilliput/pull > >>> > /12__;!!ACWV5N9M2RV99hQ!aUxkQDk7iCst4N5JxPMKOMHYitKvSbW0BvuEmk > 62FMNs > >>> YTbEWHPMhpcl4YbuZ7rXELA$ > >>> > >>> This required some coordination in other GCs, because it's not > >>> always safe to access the object header. In particular, objects may > >>> be locked, at which point we need to find the displaced header, or > >>> worst case, inflate the header. I believe I've solved that in all GCs. > >>> > >>> However, I am still getting a failure with ZGC, which is kinda > >>> unexpected, because it's the only GC that is *not* messing with > >>> object headers (as far as I know. If you check out the above PR, the > >>> failure can easily reproduced with: > >>> > >>> make run-test TEST=gc/z/TestGarbageCollectorMXBean.java > >>> > >>> (and only that test is failing for me). > >>> > >>> The crash is in ZHeap::is_object_live() because the ZPage there > >>> turns out to be NULL. I've added a bunch of debug output in that > >>> location, and it looks like the offending object is always inflated > >>> *and* forwarded when it happens, but I fail to see how this is > >>> related to each other, and to the page being NULL. I strongly > >>> suspect that inflation of the object header by calling klass() on it > >>> causes the troubles. Changing back to original implementation of > >>> oopDesc::klass() (swap commented-out-code there) makes the bug > >>> disappear. > >>> > >>> Also, the bug always seems to happen when calling through a weak > >>> barrier. Not sure if that is relevant. > >>> > >>> Any ideas? Opinions? > >>> > >>> Thanks, > >>> Roman > >>> > >> From rkennke at redhat.com Wed Jul 21 09:12:25 2021 From: rkennke at redhat.com (Roman Kennke) Date: Wed, 21 Jul 2021 11:12:25 +0200 Subject: [External] : Re: Need help with ZGC failure in Lilliput In-Reply-To: References: <1717fcdc-1eb4-97ac-6361-d44a6bf27457@oracle.com> Message-ID: <342aa6e6-076a-b536-58e5-f97e6ddf98e3@redhat.com> >> The problem always happens during weak storage scan. Inflation of a lock >> creates new weak handle that points back to the object. I suspect that, at the >> very least, this must not point to the old copy of that object. I need to think this >> through. > > Is the lock inflation provoked by the GC thread itself, as a side effect of reading the klass? Yes. The more I think about it, the more I come to the conclusion that this might be a no-go, except maybe for a dirty prototype (which is my current goal, tbh). Can you share more about Robbin's work? Because this sounds like it would help avoid such problems. Thanks, Roman From erik.osterlund at oracle.com Wed Jul 28 19:10:27 2021 From: erik.osterlund at oracle.com (Erik Osterlund) Date: Wed, 28 Jul 2021 19:10:27 +0000 Subject: [External] : Re: Need help with ZGC failure in Lilliput In-Reply-To: <342aa6e6-076a-b536-58e5-f97e6ddf98e3@redhat.com> References: <1717fcdc-1eb4-97ac-6361-d44a6bf27457@oracle.com> , <342aa6e6-076a-b536-58e5-f97e6ddf98e3@redhat.com> Message-ID: <5A75E206-D571-4012-AF0B-14BBB16F9304@oracle.com> Hi Roman, The idea last time I checked was to have a Java hash table to map objects to their inflated monitors, and just a few bits for locking to manage the simplest non-reentrant locking and the inflation process. And throw away stack locks and the C++ monitors we have today. I think at the moment it?s just a big fat lock though in the prototype, just to get the semantics right of essentially changing monitorenter/exit to static calls to Java code. /Erik > On 21 Jul 2021, at 11:12, Roman Kennke wrote: > > ? >> >>> The problem always happens during weak storage scan. Inflation of a lock >>> creates new weak handle that points back to the object. I suspect that, at the >>> very least, this must not point to the old copy of that object. I need to think this >>> through. >> Is the lock inflation provoked by the GC thread itself, as a side effect of reading the klass? > > Yes. The more I think about it, the more I come to the conclusion that this might be a no-go, except maybe for a dirty prototype (which is my current goal, tbh). > > Can you share more about Robbin's work? Because this sounds like it would help avoid such problems. > > Thanks, > Roman >