SIGSEGV of JDK21+ when using jweak references (possible regression)
Leslie Zhai
zhaixiang at loongson.cn
Tue Sep 26 13:25:35 UTC 2023
Hi Norman,
Strange is why jdk17[1] is not able to reproduce the issue?
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Netty/Incubator/Codec/Parent/Quic 0.0.51.Final-SNAPSHOT:
[INFO] [INFO] Netty/Incubator/Codec/Parent/Quic .................. SUCCESS [ 0.759 s]
[INFO] Netty/Incubator/Codec/Classes/Quic ................. SUCCESS [ 1.878 s]
[INFO] Netty/Incubator/Codec/Native/Quic .................. SUCCESS [06:26 min]
[INFO] Netty/Testsuite/NativeImage ........................ SUCCESS [ 0.055 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 06:29 min
[INFO] Finished at: 2023-09-26T21:19:30+08:00
[INFO] ------------------------------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hs_err_pid169206.log.tar.xz
Type: application/x-xz
Size: 19320 bytes
Desc: not available
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20230926/db0cd505/hs_err_pid169206.log.tar-0001.xz>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: hs_err_pid183922.log.tar.xz
Type: application/x-xz
Size: 14340 bytes
Desc: not available
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20230926/db0cd505/hs_err_pid183922.log.tar-0001.xz>
-------------- next part --------------
hs_err_pid169206.log is fastdebug, and hs_err_pid183922.log is release for jdk[2].
1. https://github.com/openjdk/jdk17u-dev/commit/1d618a30ee57e5bfd4b98107a239b54de3d057bf
2. https://github.com/openjdk/jdk/commit/e510dee162612d9a706ba54d0ab79a49139e77d8
Thanks,
Leslie Zhai
> 2023?9?26? 21:16?Norman Maurer <norman.maurer at googlemail.com> ???
>
> Ha, yes that fixes it..Seems like I missed deleting this line when I introduced the jweak stuff. Thanks a lot for the help. I am really happy that this turns out to be my bug :)
>
> I guess I was just lucky enough with earlier releases that there were no asserts in place yet.
>
> Bye
> Norman
>
>
>> On 26. Sep 2023, at 14:53, Stefan Karlsson <stefan.karlsson at oracle.com> wrote:
>>
>> Hi Norman,
>>
>> Thanks for the detailed description of the problem!
>>
>> I didn't think that this would be a bug in the JVM so I debugged this with a debug build of the JDK and found that the crash happens because the code tries to delete a global handle with the local handles APIs:
>>
>> # Internal Error (/home/stefank/git/jdk/open/src/hotspot/share/runtime/jniHandles.inline.hpp:52), pid=1340857, tid=1340858
>> # assert(is_local_tagged(handle)) failed: precondition
>>
>> I found the offending line by adding a watch point for the address of the handle and reverse-finish:ing with rr. This showed me that the global handle was created here:
>> NETTY_JNI_UTIL_LOAD_CLASS(env, sslTaskClass, name, done);
>>
>> I think this fixes the bug:
>> diff --git a/codec-native-quic/src/main/c/netty_quic_boringssl.c b/codec-native-quic/src/main/c/netty_quic_boringssl.c
>> index c3f9a8e..cae8238 100644
>> --- a/codec-native-quic/src/main/c/netty_quic_boringssl.c
>> +++ b/codec-native-quic/src/main/c/netty_quic_boringssl.c
>> @@ -1394,7 +1394,6 @@ jint netty_boringssl_JNI_OnLoad(JNIEnv* env, const char* packagePrefix) {
>> NETTY_JNI_UTIL_LOAD_CLASS_WEAK(env, sslTaskClassWeak, name, done);
>> NETTY_JNI_UTIL_NEW_LOCAL_FROM_WEAK(env, sslTaskClass, sslTaskClassWeak, done);
>>
>> - NETTY_JNI_UTIL_LOAD_CLASS(env, sslTaskClass, name, done);
>> NETTY_JNI_UTIL_GET_FIELD(env, sslTaskClass, sslTaskReturnValue, "returnValue", "I", done);
>> NETTY_JNI_UTIL_GET_FIELD(env, sslTaskClass, sslTaskComplete, "complete", "Z", done);
>> NETTY_JNI_UTIL_GET_METHOD(env, sslTaskClass, sslTaskDestroyMethod, "destroy", "()V", done);
>>
>> With this patch all the tests passes.
>>
>> HTH,
>> StefanK
>>
>> On 2023-09-26 13:06, Norman Maurer wrote:
>>> Hi all,
>>>
>>> When trying to upgrade our build CI to use JDK21 for netty, we noticed that we were seeing SIGSEGV during GC operations. After days of debugging and looking around I am quite sure this is actually a JDK bug / regression introduced in JDK21.
>>> The relevant issue that introduced this bug / regression is https://bugs.openjdk.org/browse/JDK-8299089 and so the commit https://github.com/openjdk/jdk21u/commit/c7056737e33d3d5a6ec24639d46b9e3e7a8da01a.
>>>
>>> I verified this by checkout c7056737e33d3d5a6ec24639d46b9e3e7a8da01a and build the JDK myself. This still produced the SIGSEGV when running our testsuite with the built JDK image. When checkout the previous commit 66f7387b5ffa53861b92b068fb9832fc433d9f79, rebuilding the JDK and running the testsuite of netty everything works as expected and there is no SIGSEGV.
>>>
>>> The commit that will trigger the problem was introduced in netty by https://github.com/netty/netty-incubator-codec-quic/commit/487fcb24414e9d5da776d218c4ed48beface43e9 . This basically introduced the usage of jweak in our JNI code.
>>>
>>> Reproducing the SIGSEGV is really easy. Just checkout our current main branch and try to build the project while using JDK21:
>>>
>>> # git clone https://github.com/netty/netty-incubator-codec-quic.git
>>> # cd netty-incubator-codec-quic
>>> # ./mvnw clean package
>>>
>>>
>>> What is really interesting is that this segfault during GC will happen with not only G1 but also other GC implementations. I only tested G1 and ZGC for now, but I suspect others are affected as well.
>>>
>>> Here is the relevant hs_err content when using G1:
>>>
>>>
>>> --------------- T H R E A D ---------------
>>>
>>> Current thread (0x0000000129689910): WorkerThread "GC Thread#6" [id=32771, stack(0x000000016eaa4000,0x000000016eca7000) (2060K)]
>>>
>>> Stack: [0x000000016eaa4000,0x000000016eca7000], sp=0x000000016eca6bc0, free space=2058k
>>> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
>>> V [libjvm.dylib+0x43a73c] void G1ParCopyClosure<(G1Barrier)0, false>::do_oop_work<oopDesc*>(oopDesc**)+0x40
>>> V [libjvm.dylib+0x43cdfc] G1RootProcessor::process_vm_roots(G1RootClosures*, G1GCPhaseTimes*, unsigned int)+0x3f0
>>> V [libjvm.dylib+0x43c89c] G1RootProcessor::evacuate_roots(G1ParScanThreadState*, unsigned int)+0x7c
>>> V [libjvm.dylib+0x444a84] G1EvacuateRegionsTask::scan_roots(G1ParScanThreadState*, unsigned int)+0x24
>>> V [libjvm.dylib+0x444970] G1EvacuateRegionsBaseTask::work(unsigned int)+0x80
>>> V [libjvm.dylib+0xae482c] WorkerThread::run()+0x94
>>> V [libjvm.dylib+0xa28ef8] Thread::call_run()+0xc8
>>> V [libjvm.dylib+0x865220] thread_native_entry(Thread*)+0x158
>>> C [libsystem_pthread.dylib+0x6fa8] _pthread_start+0x94
>>>
>>> siginfo: si_signo: 11 (SIGSEGV), si_code: 2 (SEGV_ACCERR), si_addr: 0x000000000000faf8
>>>
>>>
>>> And here for ZGC
>>>
>>>
>>> --------------- T H R E A D ---------------
>>>
>>> Current thread (0x000000011af04080): WorkerThread "XWorker#2" [id=13571, stack(0x000000016c054000,0x000000016c257000) (2060K)]
>>>
>>> Stack: [0x000000016c054000,0x000000016c257000], sp=0x000000016c250d00, free space=2035k
>>> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
>>> V [libjvm.dylib+0xafde10] XMark::mark_and_follow(XMarkContext*, XMarkStackEntry)+0x160
>>> V [libjvm.dylib+0xafe8c4] XMark::work_without_timeout(XMarkContext*)+0xe0
>>> V [libjvm.dylib+0xafee24] XMark::work(unsigned long long)+0xd0
>>> V [libjvm.dylib+0xb1c6c4] XTask::Task::work(unsigned int)+0x28
>>> V [libjvm.dylib+0xae482c] WorkerThread::run()+0x94
>>> V [libjvm.dylib+0xa28ef8] Thread::call_run()+0xc8
>>> V [libjvm.dylib+0x865220] thread_native_entry(Thread*)+0x158
>>> C [libsystem_pthread.dylib+0x6fa8] _pthread_start+0x94
>>>
>>> siginfo: si_signo: 11 (SIGSEGV), si_code: 2 (SEGV_ACCERR), si_addr: 0x0000000000000100
>>>
>>>
>>> Let me know if there is anything else you need or if this is the wrong mailinglist. There is also still the possibility that its caused by a bug in our code but at the moment it feels more like a JDK regression / bug.
>>>
>>> Thanks
>>> Norman
>>>
>>
More information about the hotspot-gc-dev
mailing list