Jetty hangs early on Java 12 on OSv with Threadlocal Handshakes ON - shorter version

Fri Aug 9 13:05:35 UTC 2019

Hi,

On 2019-08-09 14:23, Waldek Kozaczuk wrote:
> Hi,
> 
> Thanks for your suggestions.
> 
> I have tried against  openjdk-jdk12-latest-linux-x86_64-release.tar.xz from 
> https://builds.shipilev.net/openjdk-jdk12/ and openjdk-jdk13-latest-linux-x86_64-release.tar.xz 
> (Java 13) from https://builds.shipilev.net/openjdk-jdk13/  and I am seeing same 
> symptoms - hanging.

Pickup a slowdebug jdk/jdk, I think Shipilev calls that openjdk-jdk.

> 
> I have also tried to use the fast-debug distribution 
> - openjdk-jdk12-latest-linux-x86_64-fastdebug.tar.xz - but unfortunately I hit 
> this snag per gdb:

Either just start the process with
 >java -cp . MyHang &
pid is printed
wait until hang
 >gdb -p <pid>
gdb>i thread
note the VMThread number
gdb>t VMThreadNumber
gdb>step
...
until you see what thread is not completed
if all threads are completed but we are still not reached zero
a thread wrongfully elided the handshake

Alternative you can also do:
 >gdb --args java -cp . MyHang
 >handle all nostop noprint pass
gdb will now ignore signals
gdb>run
wait for hang, do what I describe above

> 
> 2. It iterates through all JavaThread instances from the jtiwh collection (what 
> exactly is this?) 

We use hazard pointers for lock-free JavaThread accesses.
jtiwh is a iterator for looping over this lock-free data-structure.

and SHOULD end up calling HandshakeState::process_by_vmthread
> (https://github.com/openjdk/jdk/blob/jdk-12%2B12/src/hotspot/share/runtime/handshake.cpp#L361) 
> which eventually should call HandshakeThreadsOperation::do_handshake(JavaThread* 
> thread) for each JavaThread and increment the _done semaphore BUT only if the 
> process_by_vmthread() calls _operation->do_handshake(target), right? But there 
> are at least 3 IFs which may cause this method to return and the critical 4th IF 
> (https://github.com/openjdk/jdk/blob/jdk-12%2B12/src/hotspot/share/runtime/handshake.cpp#L379-L389) 
> which only executes its body if vmthread_can_process_handshake returns true. 
> What could make these first 3 IF be true and make
> process_by_vmthread to return BEFORE the last if? When would last IF be false?

Either the JavaThread itself process the handshake, if it has not done so 
VMThread checks is it would be safe to processes it on behalf of the JavaThread.

> 
> 3. VM_HandshakeAllThreads::doit() loops by calling poll_for_completed_thread() 
> which NEVER completes.

It do that just fine on Linux/Windows/OSX.
I'm guessing there is a bug in one of the OS primitives used.
Since the uses of semaphores is new in hotspot, that would be my first 
candidate. There are one per thread semaphore and one operation semaphore.
For every post on the operation semaphore number_of_threads_completed is 
increased. If VMThread never leaves means he did get all the posts he wanted.

Instead of gdb, it may be easier to start with adding:
-XX:HandshakeTimeout=1000

That should give some valuable output.

/Robbin

> 
> Correct me where I made any mistake in my thinking.
> 
> Thanks in advance for your support,
> Waldek
> 
> 
> 
> 
> 
> 
> On Thu, Aug 8, 2019 at 6:53 AM Robbin Ehn <robbin.ehn at oracle.com 
> <mailto:robbin.ehn at oracle.com>> wrote:
> 
>     Hi Waldek,
> 
> 
>      > #0  0x0000100000c86370 in sem_trywait at plt ()
>      > #1  0x0000100001702bf2 in PosixSemaphore::trywait() ()
>      > #2  0x000010000121aab4 in VM_HandshakeAllThreads::doit() ()
> 
>     VMThread is trying to figure out which threads have not executed their
>     handshake
>     operation and possible execute it for them (e.g. JavaThread in native).
> 
>     If you attach gdb and step around in VM_HandshakeAllThreads::doit while looping
>     there, you can figure out why we never finish.
> 
>      > When I explicitly enable thread local handshakes on Java 11 it works as
>      > well:
> 
>     In 11 it is only used by ZGC, but in 12 it is used for stack-scanning after
>     nmethods.
> 
>     Sweeper thread:
> 
>     #12 0x0000100001219a48 in Handshake::execute(ThreadClosure*) ()
>     #13 0x00001000018b4a0c in NMethodSweeper::possibly_sweep() ()
> 
>      > One more datapoint - I am able to run a simple Hello World java app on Java
>      > 12 on OSv with thread-local handshakes on so it looks like this problem
>      > appears on more complicated apps.
> 
>     If you never get to needing to sweep nmethod, you will have never executed a
>     handshake. (but you will execute safepoints using the handshake poll)
> 
>     If you can easily reprod, take a slowdebug and when the process hang, attach
>     gdb
>     switch to VMThread and step around, you will see a count which decrease once
>     for
>     every thread which have executed it's handshake. To figure out if a thread have
>     executed it's handshake, VMThread will take a per thread semaphore (decrease
>     it), which serialize access to the per thread handshake operation, and then
>     look
>     if the operation is completed, return the semaphore (increase it), and look at
>     next thread.
> 
>     /Robbin
> 
>      >
>      > *Lastly, the exact same app works fine on Java 12 on Linux.*
>      >
>      > So given all that could you please point me to what the problem might be?
>      > What has changed in the implementation of thread local handshakes between
>      > Java 11 and 12 that may have caused this issue (I see many changes made to
>      >
>     https://github.com/openjdk/jdk/blob/0fe0312f416add1536a45ecfb292c887ef7e02bd/src/hotspot/share/runtime/handshake.cpp
>      > but
>      > I am not even sure which commit matches the version of Java 12 or Java 11 I
>      > am using)? I am thinking this has to do with some subtle differences
>      > between relevant ABI implementation between OSv and Linux. Signals related?
>      >
>      > Thanks for your help in advance,
>      >
>      > Waldek
>      >
>      > Thread dump of all Java threads (in other email that was to big)
>      >
>