SMP JNI issue, UseMembar workaround resolves it

Sat Jun 11 14:00:59 PDT 2011

I understand... it is actually crashing durring runtime, not in a debug
session. I am just looking at the core dump. But yeah, it just prints out
"segmentation violation" and that's it. Not HS_err file or whatever, no
message from the vm.

I will have a look at what the native code is doing in terms of installing
signal handlers and let you know what I find out.

> Correction ...
>
> David Holmes said the following on 06/11/11 17:04:
>> I'll try and take a deeper look at this but note that if a safepoint is
>> pending the thread is supposed to "crash" in
>> write_memory_serialize_page. The SEGV so generated should be handled by
>> the VM and take the thread to the safepoint.
>
> The SEGV handler doesn't take the thread to the safepoint it just delays
> the thread until the serialization page is unprotected. The subsequent
> state transition logic will take the thread to the safepoint if needed.
>
> They thing is to see exactly what is reported when the real crash
> occurs. If we get a simple OS-level abort message then the VM signal
> handler did not get invoked which may indicate that native code has
> changed the installed signal handlers.
>
> David
>
>
>> It seems the signal is not
>> being handled correctly. UseMembar will workaround this by not using the
>> serialization page.
>>
>> If you observed this SEGV under gdb then it may be a red-herring as gdb
>> is stopping the VM from handling the SEGV when it is actually an
>> expected signal.
>>
>> When the real crash occurs what exactly gets reported?
>>
>> David Holmes
>>
>> Scott Valentine said the following on 06/11/11 15:58:
>>> We ran into an issue where our application would consistently crash
>>> with a
>>> segmentation violation after roughly 15 minutes to 90 minutes of
>>> runtime.
>>> It's not exactly a bug, but I thought it would be helpful to post the
>>> information here for other folks, and to hopefully support the great
>>> work
>>> of OpenJDK developers down the road.
>>>
>>> The quick details are that we consistently die without much error
>>> detail
>>> (just a simple segmentation violation printout) when our code enters
>>> JNI,
>>> does some stuff, and then calls back into the VM. The JNI_ENTRY fails
>>> when
>>> calling transition_from_native.
>>>
>>> The client application is running on an Asus Aspire-One netbook (Atom
>>> N270, dual core @800MHz) with OpenJDK-1.6.0-20-1.9.7. A gdb stack trace
>>> and jstack dump is attached for details on what is happening. More
>>> details
>>> on the system structure are included below for those interested, but
>>> basically it is a moderately threaded, intensively JNI application
>>> running
>>> under the Equinox OSGi runtime.
>>>
>>> It was a little tough to debug, as the clients are remote and I have
>>> to go
>>> through multiple ssh back-doors. We initially suspected our JNI
>>> middleware, but after getting the necessary debugging symbols, tools,
>>> and
>>> builds in place, we found that it was always crashing on the
>>> write_memory_serialize_page call when attempting JNI_ENTRY after
>>> spending
>>> some time in the native code. It never even got to the point of
>>> reference
>>> values like the VM env, jobject, etc. Anyhow, the source for the
>>> transition_from_native call led us to try the -X:+UseMembar option
>>> which
>>> seems to have resolved the issue.
>>>
>>> Anyhow, I hope the trace info is helpful, and please let me know if I
>>> can
>>> provide more info. I can't spare a ton of cycles, but I would be happy
>>> to
>>> contribute as time permits.
>>>
>>> Here are the application details:
>>>
>>> As mentioned previously, the application is running in the Equinox OSGi
>>> framework, and it relies heavily on two JNI libraries: the RXTX library
>>> (2.1-7r2), and a middleware called opensplice DDS (5.4.1). Opensplice
>>> is a
>>> shared memory model runtime that runs as three seperate processes, and
>>> has
>>> a JNI interface into the framework. The application has two serial
>>> devices
>>> (two RXTX threads), and we have a thread for each (two more threads)
>>> that
>>> does blocking reads on those ports. These threads put data into a
>>> BlockingQueue, which has another thread that takes data from the queue
>>> and
>>> processes it (two more threads). These threads process the data, make
>>> JNI
>>> calls into the DDS middleware (this is where the failures have, at
>>> least
>>> so far, always occurred), and put some information into another
>>> Blocking
>>> Queue. There are two other application threads (total of eight now).
>>> The
>>> first periodically writes to one of the serial port. The other thread
>>> handles the second blocking Queue and also makes JNI calls into the DDS
>>> middleware. Overall, there are three threads calling into that
>>> middleware
>>> independantly.
>>>
>>> I think there are something like 20 threads total, but three are the
>>> JVM
>>> threads, and 7 or so are related to Equinox and our launcher that don't
>>> really do anything unless the system is starting or stopping or doing
>>> something in the OSGi world.
>>>
>>> Thanks, and again, I hope this info can be helpfult to others.
>>>
>>> Scott Valentine
>>>
>>> Concentris Systems LLC
>>> Manoa Innovation Center, Suite #238
>>> 2800 Woodlawn Drive
>>> Honolulu, HI  96822
>>>
>>> http://www.Concentris-Systems.com
>>>
>>> (808) 988-6100
>

Scott Valentine

Concentris Systems LLC
Manoa Innovation Center, Suite #238
2800 Woodlawn Drive
Honolulu, HI  96822

http://www.Concentris-Systems.com

(808) 988-6100