SMP JNI issue, UseMembar workaround resolves it

Sun Jun 12 18:07:07 PDT 2011

> I understand... it is actually crashing durring runtime, not in a debug
> session. I am just looking at the core dump. But yeah, it just prints out
> "segmentation violation" and that's it. Not HS_err file or whatever, no
> message from the vm.
>
> I will have a look at what the native code is doing in terms of installing
> signal handlers and let you know what I find out.
>
>> Correction ...
>>
>> David Holmes said the following on 06/11/11 17:04:
>>> I'll try and take a deeper look at this but note that if a safepoint is
>>> pending the thread is supposed to "crash" in
>>> write_memory_serialize_page. The SEGV so generated should be handled by
>>> the VM and take the thread to the safepoint.
>>
>> The SEGV handler doesn't take the thread to the safepoint it just delays
>> the thread until the serialization page is unprotected. The subsequent
>> state transition logic will take the thread to the safepoint if needed.
>>
>> They thing is to see exactly what is reported when the real crash
>> occurs. If we get a simple OS-level abort message then the VM signal
>> handler did not get invoked which may indicate that native code has
>> changed the installed signal handlers.
>>
>> David
>>

After gathering further background based on the information you provided,
I found that the JNI code "attempts" to install signal handlers for
cleanup purposes, but punts the operation if it finds that libjsig is not
LD_PRELOADed (actually, it logs a warning in the info log - duh - but
nothing in the error log, so I missed it). In any case, I tried this
approach, which eliminated the warning message, but that runtime crashed
after about an hour without the -XX+UseMembar option. Essentially the
behavior is the same. I have had two clients running for about 72 hours
without issue using the memory barrier, so we are happy to at least have a
workaround for now.

If you have further suggestions for diagnosing the root cause of the
error, we can spend some minor effort looking into the issue further, but
as we are running stable at this point, we consider the problem resolved
internally.

I appreciate the quick response, and again, let me know if there is
further useful information I can provide

-Scott V.

>>
>>> It seems the signal is not
>>> being handled correctly. UseMembar will workaround this by not using
>>> the
>>> serialization page.
>>>
>>> If you observed this SEGV under gdb then it may be a red-herring as gdb
>>> is stopping the VM from handling the SEGV when it is actually an
>>> expected signal.
>>>
>>> When the real crash occurs what exactly gets reported?
>>>
>>> David Holmes
>>>
>>> Scott Valentine said the following on 06/11/11 15:58:
>>>> We ran into an issue where our application would consistently crash
>>>> with a
>>>> segmentation violation after roughly 15 minutes to 90 minutes of
>>>> runtime.
>>>> It's not exactly a bug, but I thought it would be helpful to post the
>>>> information here for other folks, and to hopefully support the great
>>>> work
>>>> of OpenJDK developers down the road.
>>>>
>>>> The quick details are that we consistently die without much error
>>>> detail
>>>> (just a simple segmentation violation printout) when our code enters
>>>> JNI,
>>>> does some stuff, and then calls back into the VM. The JNI_ENTRY fails
>>>> when
>>>> calling transition_from_native.
>>>>
>>>> The client application is running on an Asus Aspire-One netbook (Atom
>>>> N270, dual core @800MHz) with OpenJDK-1.6.0-20-1.9.7. A gdb stack
>>>> trace
>>>> and jstack dump is attached for details on what is happening. More
>>>> details
>>>> on the system structure are included below for those interested, but
>>>> basically it is a moderately threaded, intensively JNI application
>>>> running
>>>> under the Equinox OSGi runtime.
>>>>
>>>> It was a little tough to debug, as the clients are remote and I have
>>>> to go
>>>> through multiple ssh back-doors. We initially suspected our JNI
>>>> middleware, but after getting the necessary debugging symbols, tools,
>>>> and
>>>> builds in place, we found that it was always crashing on the
>>>> write_memory_serialize_page call when attempting JNI_ENTRY after
>>>> spending
>>>> some time in the native code. It never even got to the point of
>>>> reference
>>>> values like the VM env, jobject, etc. Anyhow, the source for the
>>>> transition_from_native call led us to try the -X:+UseMembar option
>>>> which
>>>> seems to have resolved the issue.
>>>>
>>>> Anyhow, I hope the trace info is helpful, and please let me know if I
>>>> can
>>>> provide more info. I can't spare a ton of cycles, but I would be happy
>>>> to
>>>> contribute as time permits.
>>>>
>>>> Here are the application details:
>>>>
>>>> As mentioned previously, the application is running in the Equinox
>>>> OSGi
>>>> framework, and it relies heavily on two JNI libraries: the RXTX
>>>> library
>>>> (2.1-7r2), and a middleware called opensplice DDS (5.4.1). Opensplice
>>>> is a
>>>> shared memory model runtime that runs as three seperate processes, and
>>>> has
>>>> a JNI interface into the framework. The application has two serial
>>>> devices
>>>> (two RXTX threads), and we have a thread for each (two more threads)
>>>> that
>>>> does blocking reads on those ports. These threads put data into a
>>>> BlockingQueue, which has another thread that takes data from the queue
>>>> and
>>>> processes it (two more threads). These threads process the data, make
>>>> JNI
>>>> calls into the DDS middleware (this is where the failures have, at
>>>> least
>>>> so far, always occurred), and put some information into another
>>>> Blocking
>>>> Queue. There are two other application threads (total of eight now).
>>>> The
>>>> first periodically writes to one of the serial port. The other thread
>>>> handles the second blocking Queue and also makes JNI calls into the
>>>> DDS
>>>> middleware. Overall, there are three threads calling into that
>>>> middleware
>>>> independantly.
>>>>
>>>> I think there are something like 20 threads total, but three are the
>>>> JVM
>>>> threads, and 7 or so are related to Equinox and our launcher that
>>>> don't
>>>> really do anything unless the system is starting or stopping or doing
>>>> something in the OSGi world.
>>>>
>>>> Thanks, and again, I hope this info can be helpfult to others.
>>>>
>>>> Scott Valentine
>>>>
>>>> Concentris Systems LLC
>>>> Manoa Innovation Center, Suite #238
>>>> 2800 Woodlawn Drive
>>>> Honolulu, HI  96822
>>>>
>>>> http://www.Concentris-Systems.com
>>>>
>>>> (808) 988-6100
>>
>
>
> Scott Valentine
>
> Concentris Systems LLC
> Manoa Innovation Center, Suite #238
> 2800 Woodlawn Drive
> Honolulu, HI  96822
>
> http://www.Concentris-Systems.com
>
> (808) 988-6100
>
>
>
>

Scott Valentine

Concentris Systems LLC
Manoa Innovation Center, Suite #238
2800 Woodlawn Drive
Honolulu, HI  96822

http://www.Concentris-Systems.com

(808) 988-6100