RFR(s): 8218147: make_walkable asserts on multiple calls

David Holmes david.holmes at oracle.com
Tue Apr 16 07:27:19 UTC 2019


On 16/04/2019 4:51 pm, Robbin Ehn wrote:
> Hi David,
> 
> On 4/16/19 4:00 AM, David Holmes wrote:
>> Hi Robbin,
>>
>> On 15/04/2019 6:58 pm, Robbin Ehn wrote:
>>> Hi, please review.
>>>
>>> After reexamine this issue:
>>> Threads in native must always have their stack walkable.
>>> JFR sampler should never need to make a stack walkable (for native 
>>> sample).
>>>
>>> I manage to locally reproduce reliable with changes to JFR sampler 
>>> and having
>>> hundreds of threads running similar code as the in the bug.
>>> (Looping creating an array with negative size.)
>>>
>>> I found a place where we don't proper look at the suspend flags.
>>> The java thread can thus escape native and make it's stack unwalkable 
>>> and later
>>> it tries to make it walkable at the same time as the JFR sampler.
>>>
>>> By removing some kind of fast check and instead always call the 
>>> check_safepoint_and_suspend_for_native_trans I can no longer reproduce.
>>
>> Sorry but I can't see how this can fix anything:
>>
>> -     if (SafepointMechanism::should_block(thread) || 
>> thread->is_suspend_after_native()) {
>>          
>> JavaThread::check_safepoint_and_suspend_for_native_trans(thread);
>> -     }
>>
> 
> In check_safepoint_and_suspend_for_native_trans we check _trace_flag and 
> stop with macro:
> JFR_ONLY(SUSPEND_THREAD_CONDITIONAL(thread);)

Ah I see - needed to expand that macro.

> This method does not do the right thing:
> bool is_suspend_after_native() const {
>    return (_suspend_flags & (_external_suspend | _deopt_suspend)) != 0;
> }
> 
> If we want to keep the out-of-line double checking, this is an alternative:
> diff -r 51211b2d6514 src/hotspot/share/runtime/thread.hpp
> --- a/src/hotspot/share/runtime/thread.hpp    Tue Apr 16 08:38:32 2019 
> +0200
> +++ b/src/hotspot/share/runtime/thread.hpp    Tue Apr 16 08:44:21 2019 
> +0200
> @@ -1417,3 +1417,3 @@
>     bool is_suspend_after_native() const {
> -    return (_suspend_flags & (_external_suspend | _deopt_suspend)) != 0;
> +    return (_suspend_flags & (_external_suspend | _deopt_suspend | 
> _trace_flag)) != 0;
>     }

Right. Should that use JFR_ONLY?

I was a little concerned that thread->set_trace_flag() may not ensure 
visibility of the flag update, but then realized it should be covered by 
the fence:

  static inline void transition_from_native(JavaThread *thread, 
JavaThreadState to) {
     // Change to transition state and ensure it is seen by the VM thread.
     thread->set_thread_state_fence(_thread_in_native_trans);

and the comment should say:

// Change to transition state and ensure it is seen by other thread,
// and we will see any _suspend_flag changes below.

However it also seems to me that in 
JfrThreadSampleClosure::do_sample_thread we need a storeload() barrier 
after:

   thread->set_trace_flag();

to ensure its not reordered with the reads of thread->thread_state() ?

Thanks,
David
-----


> So we are missing checking that bit completely in this transition code.
> 
> Thanks, Robbin
> 
> 
>> All you are doing is changing the timing of the race between the 
>> thread re-entering the VM/Java and the request for a suspend or 
>> safepoint.
>>
>> If there is a race between the sampler logic acting on the thread, and 
>> the thread acting on itself then that race has to be precluded somehow.
>>
>> Thanks,
>> David
>> -----
>>
>>> (which have the JFR native trans suspend check)
>>> And it passes t1-5.
>>>
>>> Code:
>>> http://cr.openjdk.java.net/~rehn/8218147/v3/webrev/
>>> Issue:
>>> https://bugs.openjdk.java.net/browse/JDK-8218147
>>>
>>> Thanks, Robbin
>>>
>>> On 4/5/19 5:43 PM, Robbin Ehn wrote:
>>>> Hi Dean,
>>>>
>>>> Sorry, I missed this mail.
>>>> Yes we can do that.
>>>> Ignore my other mail, I'll update.
>>>>
>>>> Thanks, Robbin
>>>>
>>>>
>>>> dean.long at oracle.com skrev: (5 april 2019 09:22:24 CEST)
>>>>> On 4/4/19 5:16 PM, dean.long at oracle.com wrote:
>>>>>>
>>>>>>>>
>>>>>>>> If it's already set, should we check that _last_Java_pc matches the
>>>>>
>>>>>>>> new value?
>>>>>>>
>>>>>>> We manually set the pc in several places, so if it's set, it's not
>>>>>>> certain that
>>>>>>> it should be the same as in last sp.
>>>>>>> I can't distinguish between the cases.
>>>>>>>
>>>>>>
>>>>>> If we get pc from sp[-1] then it should match, but you're right, we
>>>>>> sometimes get pc from somewhere else.
>>>>>
>>>>> How about if we combine the !walkable check and the
>>>>> capture_last_Java_pc() logic into a single method?
>>>>> Then we can do something like:
>>>>>
>>>>>      if (!walkable()) {
>>>>>          address pc = (address)_last_Java_sp[-1];
>>>>>          address a = Atomic::cmpxchg(pc, &_last_Java_pc, NULL);
>>>>>          assert(a == NULL || a == pc, "unexpected PC %p", a);
>>>>>      }
>>>>>
>>>>> dl


More information about the hotspot-jfr-dev mailing list