RFR(s): 8220774: Add HandshakeALot diag option

Thu Mar 21 22:01:45 UTC 2019

I will review now that the dust has settled :)

Might be a few hours.

David

On 22/03/2019 2:04 am, Robbin Ehn wrote:
> On 2019-03-21 16:11, Daniel D. Daugherty wrote:
>>> Check-out v3, just full:
>>> http://cr.openjdk.java.net/~rehn/8220774/v3/webrev/ 
>>
>> src/hotspot/share/runtime/globals.hpp
>>      No comments.
>>
>> src/hotspot/share/runtime/thread.hpp
>>      No comments.
>>
>> src/hotspot/share/runtime/thread.cpp
>>      L2954:   assert((!has_last_Java_frame() && java_call_counter() == 
>> 0) ||
>>      L2955:          (has_last_Java_frame() && java_call_counter() > 
>> 0), "wrong java_sp info!");
>>         Perhaps change to this:
>>
>>               assert((!has_last_Java_frame() && java_call_counter() == 
>> 0) ||
>>                      (has_last_Java_frame() && java_call_counter() > 0),
>>                      "unexpected frame info: has_last_frame=%d, 
>> java_call_counter=%d",
>>                      has_last_Java_frame(), java_call_counter());
>>
>>          and consider making the same change to the nmethod version.
>>
> 
> Fixed both!
> 
>> src/hotspot/share/runtime/vmThread.hpp
>>      No comments.
>>
>> src/hotspot/share/runtime/vmThread.cpp
>>      L449:   // Must check for handshakes first, since ops returns.
>>          Perhaps:
>>              // Check for handshakes first since we may need to return 
>> a VMop.
>>
>>      L454:   // Check for a true cleanup first, trying to keep stats 
>> correct.
>>          Perhaps:
>>              // Check for a cleanup before SafepointALot to keep stats 
>> correct.
>>
>>      L635:     // We want to make sure that we get to a safepoint 
>> regularly.
>>          Perhaps an addition:
>>                // We want to make sure that we get to a safepoint 
>> regularly
>>                // even when executing VMops that don't require 
>> safepoints.
>>
> 
> Fixed above!
> 
> Publishing v4, since I need a second review:
> Full:
> http://cr.openjdk.java.net/~rehn/8220774/v4/webrev/
> Inc:
> http://cr.openjdk.java.net/~rehn/8220774/v4/inc/webrev/
> 
> Thanks, Robbin
> 
>> Okay, I now grok why VMThread::no_op_safepoint() does not need a
>> 'check_time' parameter. Thanks for restoring the no_op_safepoint()
>> call at the bottom of the "while(true)" in VMThread::loop().
>>
>> If you decide to tweak the semantics for SafepointALot down the road,
>> it would be best to do that in its own bug rather than as part of
>> another bug.
>>
>> Thumbs up! Your call on whether to tweak the assert or change the 
>> comments.
>>
>> Dan
>>
>>
>>
>> On 3/21/19 10:28 AM, Daniel D. Daugherty wrote:
>>> On 3/21/19 9:57 AM, Robbin Ehn wrote:
>>>> Hi Dan,
>>>>
>>>> On 2019-03-20 21:21, Daniel D. Daugherty wrote:
>>>>>> And here is v2 for you to consider:
>>>>>> http://rehn-ws.se.oracle.com/cr_mirror/8220774/v2/webrev/ 
>>>>>
>>>>> It's too difficult to craft my comments relative to the incremental
>>>>> webrev so I'm working with the full webrev still.
>>>>>
>>>>>
>>>>> src/hotspot/share/runtime/globals.hpp
>>>>>      No comments.
>>>>>
>>>>> src/hotspot/share/runtime/thread.hpp
>>>>>      No comments.
>>>>>
>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>      L2953:   // This checks that the thread have a correct frame 
>>>>> state during a handshake
>>>>>          typo: s/have/has/
>>>>>          Also please add a '.' to end of the sentence.
>>>>
>>>> Fixed.
>>>>
>>>>>
>>>>>      L2954:   assert((!has_last_Java_frame() && java_call_counter() 
>>>>> == 0) ||
>>>>>      L2955:          (has_last_Java_frame() && java_call_counter() 
>>>>> > 0), "wrong java_sp info!");
>>>>>          Trying to make sure I understand what you are asserting:
>>>>>
>>>>>          1) if the thread does not have a last Java frame then it 
>>>>> must have
>>>>>             never called Java
>>>>>          2) if the thread has a last Java frame, then it must have 
>>>>> called
>>>>>             Java at least once
>>>>>
>>>>>          I'm good with the second expression. I have some vague 
>>>>> doubts about
>>>>>          first expression. When a JavaThread is done executing Java 
>>>>> code and
>>>>>          it is on its way toward a thread exit, is there ever a 
>>>>> time when it
>>>>>          no longer has a last Java frame? I'm thinking a handshake 
>>>>> late in
>>>>>          the JavaThread's life...
>>>>
>>>> The assert is a copy paste from
>>>> L2959 void JavaThread::nmethods_do(CodeBlobClosure* cf) {
>>>>
>>>> I have seen this assert trigger twice on windows, either it's wrong 
>>>> and should
>>>> be removed from nmethods_do, or it is correct and this will help us 
>>>> find when
>>>> this happens.
>>>
>>> Discussed this with Robbin via chat. I'm good with leaving
>>> the assert in place. Perhaps tweak both like this:
>>>
>>>     assert((!has_last_Java_frame() && java_call_counter() == 0) ||
>>>            (has_last_Java_frame() && java_call_counter() > 0),
>>>            "unexpected frame info: has_last_frame=%d, 
>>> java_call_counter=%d",
>>>            has_last_Java_frame(), java_call_counter());
>>>
>>>>
>>>>>
>>>>> src/hotspot/share/runtime/vmThread.hpp
>>>>>      No comments.
>>>>>
>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>>      L503:           MutexUnlockerEx mul(VMOperationQueue_lock, 
>>>>> Mutex::_no_safepoint_check_flag);
>>>>>          Please add something like this above L503:
>>>>>
>>>>>                      // Have to unlock VMOperationQueue_lock just 
>>>>> in case
>>>>>                      // no_op_safepoint() has to do a handshake.
>>>>>
>>>>>      old L619:     // We want to make sure that we get to a 
>>>>> safepoint regularly.
>>>>>      old L620:     //
>>>>>      old L621:     if ((_cur_vm_operation = 
>>>>> VMThread::no_op_safepoint(false)) != NULL) {
>>>>>          This call to no_op_safepoint() is at the bottom of the 
>>>>> "while(true)"
>>>>>          loop in VMThread::loop(). Before the fix for 8219436, this 
>>>>> line used to be:
>>>>>
>>>>>              if (VMThread::no_op_safepoint_needed(true)) {
>>>>>
>>>>>          and it was the only call to no_op_safepoint_needed() that 
>>>>> passed
>>>>>          'true'. It's the 'true' parameter that made the comment on 
>>>>> L619
>>>>>          correct. Why do I say that? Well, the only way we get to the
>>>>>          bottom of the "while(true)" loop is if we had a 
>>>>> vm_operation to
>>>>>          perform. Let's say we had a bunch of no-safepoint 
>>>>> vm_operations
>>>>>          to perform and we just kept executing them, one after 
>>>>> another.
>>>>>          While doing this work, if our time between safepoints exceeds
>>>>>          GuaranteedSafepointInterval, then this 
>>>>> no_op_safepoint_needed(true)
>>>>>          call is what would detect that we've gone too long between 
>>>>> safepoints
>>>>>          and would force one.
>>>>>
>>>>>          With the fix for 8219436, that call was changed to 
>>>>> no_op_safepoint(false)
>>>>>          and with the v2 version of this fix that call is now gone. 
>>>>> So we no
>>>>>          longer have the ability to have 
>>>>> GuaranteedSafepointInterval work
>>>>>          right when we are doing lots of non-safepoint VM operations.
>>>>>
>>>>>          This no_op_safepoint_needed() probe point, either with 
>>>>> 'true' or
>>>>>          'false', also gave us the opportunity to force a safepoint 
>>>>> when
>>>>>          SafepointALot is true or when 
>>>>> SafepointSynchronize::is_cleanup_needed()
>>>>>          returns true. In the v2 version of this fix, we lose the 
>>>>> ability
>>>>>          to SafepointALot after each VM operation. We also lose the 
>>>>> ability
>>>>>          to do cleanup() after each VM operation.
>>>>
>>>> And I don't think that is an issue, ICache and Monitor, which is the 
>>>> cleanup we
>>>> do, just have just basic heuristic. If we cleanup every second or 
>>>> every other
>>>> second makes little difference.
>>>
>>> My point is that you are changing the semantics of SafepointALot without
>>> stating clearly that is what you are doing. The point of 
>>> SafepointALot is
>>> to inject a safepoint after every VMop even those VMops that have to be
>>> executed at a safepoint. SafepointALot is considered a stress option for
>>> a very good reason.
>>>
>>>
>>>> SafepointALot is just diag, which causes safepoint directly after a 
>>>> safepoint.
>>>> I'd rather have it doing one extra safepoint per 
>>>> GuaranteedSafepointInterval,
>>>> that way you know how many safepoint you have per second, which is 
>>>> really useful
>>>> for benchmarking. In current form it can't be used for benchmarks.
>>>
>>> You would never use SafepointALot in a benchmark run. It is is
>>> stress option.
>>>
>>>>
>>>>>
>>>>>          I think you need to restore the "bool check_time" 
>>>>> parameter to
>>>>>          no_op_safepoint() and you need to restore this call to as:
>>>>>          VMThread::no_op_safepoint_needed(true) along with the rest
>>>>>          of old L618-L626.
>>>>
>>>> I don't need the parameter, we can always check the time.
>>>> When we timeout on queue wait we will always pass the time check.
>>>
>>> As I pointed out, if you are doing non-safepoint VM ops, you can miss
>>> a time check.
>>>
>>>
>>>> So the difference is that I think the previous behavior was 
>>>> bad/buggy and you are trying to restore it :)
>>>>
>>>> Check-out v3, just full:
>>>> http://cr.openjdk.java.net/~rehn/8220774/v3/webrev/
>>>
>>> I'll post a second message with my re-review for this webrev.
>>>
>>>
>>>>
>>>> - wait on queue GuaranteedSafepointInterval
>>>> - if timeout
>>>>     - do handshake a lot
>>>>     - we are guaranteed to pass the time check, so cleanup if needed
>>>>     - else safepoint alot
>>>> - if op
>>>>     - after op
>>>>         - do handshake a lot
>>>>         - check time, if expired, cleanup if needed
>>>>         - else safepoint alot
>>>>
>>>> I think that is what you want :)
>>>> And that's what in v3.
>>>>
>>>> I would rather have:
>>>> - wait-time = GuaranteedSafepointInterval - 
>>>> SafepointTracing::time_since_last_safepoint_ms()
>>>> - if wait-time > 0
>>>>     - wait on queue
>>>> - else or timeout
>>>>     - do handshake a lot
>>>>     - cleanup if needed
>>>>     - else safepoint alot
>>>>
>>>> That way we do check for cleanup after non-safepoint in time but we 
>>>> do not
>>>> causes extra safepoints after safepoints with SafepointALot. (I know 
>>>> you think of this as feature :) )
>>>
>>> Not so much a feature as a stress option. I love my stress options.
>>>
>>> Dan
>>>
>>>>
>>>> Thanks, Robbin
>>>
>>