RFR: 8257831: Suspend with handshakes [v2]

David Holmes david.holmes at oracle.com
Wed Mar 31 13:59:49 UTC 2021


On 31/03/2021 8:51 pm, Robbin Ehn wrote:
> On Wed, 31 Mar 2021 02:43:21 GMT, David Holmes <dholmes at openjdk.org> wrote:
> 
>>> src/hotspot/os/posix/signals_posix.cpp line 1587:
>>>
>>>> 1585:   // destructor has completed.
>>>> 1586:
>>>> 1587:   if (thread->is_Java_thread() && thread->as_Java_thread()->is_terminated()) {
>>>
>>> We need @dholmes-ora to verify that this version of the code will
>>> still solve the bug he was fixing when he added old L1610.
>>
>> Thanks @dcubed-ojdk - no it won't. The problem is that the signal can hit very late in a thread's termination process, after the JavaThread destructor has executed, so no query of JavaThread state is possible. So we needed something in the Thread state and the SR_lock was a good enough proxy for that. It may be possible to use a different Thread field (perhaps _ParkEvent but encapsulated in a Thread::has_terminated() helper method).
> 
> SR_handler is used for OS-level suspend/resume (not to conflict with this change-set).
> This feature is only used by JFR (AFAIK), and JFR only samples threads on it's ThreadsList.
> This means the JavaThread can never be terminated, hence this code will always pass.

I need to dig up the exact details but from memory the problem we were 
fixing was that the signal was seemingly not being delivered to the 
target thread until it modified its own signal mask sometime during 
thread termination. IIRC JFR may have a target thread on a ThreadList 
and try to suspend it, but the signal may not get delivered, 
consequently JFR will timeout and cancel the suspend request. After that 
the thread can continue on its way and may start to terminate, at which 
point the signal gets delivered and the SR_handler runs.

> If someone else is using OS-level suspend/resume without a ThreadsList, the bug is there is no ThreadsList AFAICT.
> 
> Since ThreadLocalStorage::thread() is cleared last in ~Thread with Thread::clear_thread_current(); may be in the ~Thread destructor.
> So the argument is that would be safe to read stuff from Thread but not JavaThread?

If the SR_handler gets a non-null value from 
Thread::current_or_null_safe() then we know that the signal got 
delivered before we called clear_current_thread() in the Thread 
destructor. We don't exactly where, but we know that it could be after 
the JavaThread destructor and so trying to treat the thread as a 
JavaThread is definitely not safe (hence the way the bug was 
discovered!). We deliberately nulled _SR_lock so that we could query it 
as a termination indicator and we do that while we still have a valid 
osThread. Hence if you see a non-NULL _SR_lock you know you can safely 
interact with the osthread.

> Since the compiler (and CPU) may reorder and optimize away code, so I argue reading from a half destroyed object is not a great idea.

The Thread object itself is not destroyed at this point, only objects 
referred to there-from.

> E.g. Monitor* _SR_lock; is not a volatile pointer; since reading from this memory is UB after destruction, compiler is free to remove or not publish the store to NULL.

It may well be free to do so but that wasn't what we found. Perhaps it 
should have been volatile with a storestore() for good measure.

> So I suggest either to remove this check, since the only user is using a ThreadsList and any other should also be using that too.
> Or call Thread::clear_thread_current() before the JavaThread destructor is called, that way we can be certain that there is no UB.

That would be very fragile IMO as it would require that no code 
executing from the destructors can require access to Thread:current() to 
work. It would certainly take some analysis to determine if it could be 
done that way. (Crash reporting would be affected.)

There is no UB in what we currently do AFAICS.

David
-----

> -------------
> 
> PR: https://git.openjdk.java.net/jdk/pull/3191
> 


More information about the hotspot-runtime-dev mailing list