RFR(s): 8220774: Add HandshakeALot diag option

Thu Mar 21 15:11:02 UTC 2019

> Check-out v3, just full:
> http://cr.openjdk.java.net/~rehn/8220774/v3/webrev/ 

src/hotspot/share/runtime/globals.hpp
     No comments.

src/hotspot/share/runtime/thread.hpp
     No comments.

src/hotspot/share/runtime/thread.cpp
     L2954:   assert((!has_last_Java_frame() && java_call_counter() == 0) ||
     L2955:          (has_last_Java_frame() && java_call_counter() > 0), 
"wrong java_sp info!");
        Perhaps change to this:

              assert((!has_last_Java_frame() && java_call_counter() == 0) ||
                     (has_last_Java_frame() && java_call_counter() > 0),
                     "unexpected frame info: has_last_frame=%d, 
java_call_counter=%d",
                     has_last_Java_frame(), java_call_counter());

         and consider making the same change to the nmethod version.

src/hotspot/share/runtime/vmThread.hpp
     No comments.

src/hotspot/share/runtime/vmThread.cpp
     L449:   // Must check for handshakes first, since ops returns.
         Perhaps:
             // Check for handshakes first since we may need to return a 
VMop.

     L454:   // Check for a true cleanup first, trying to keep stats 
correct.
         Perhaps:
             // Check for a cleanup before SafepointALot to keep stats 
correct.

     L635:     // We want to make sure that we get to a safepoint regularly.
         Perhaps an addition:
               // We want to make sure that we get to a safepoint regularly
               // even when executing VMops that don't require safepoints.

Okay, I now grok why VMThread::no_op_safepoint() does not need a
'check_time' parameter. Thanks for restoring the no_op_safepoint()
call at the bottom of the "while(true)" in VMThread::loop().

If you decide to tweak the semantics for SafepointALot down the road,
it would be best to do that in its own bug rather than as part of
another bug.

Thumbs up! Your call on whether to tweak the assert or change the comments.

Dan

On 3/21/19 10:28 AM, Daniel D. Daugherty wrote:
> On 3/21/19 9:57 AM, Robbin Ehn wrote:
>> Hi Dan,
>>
>> On 2019-03-20 21:21, Daniel D. Daugherty wrote:
>>>> And here is v2 for you to consider:
>>>> http://rehn-ws.se.oracle.com/cr_mirror/8220774/v2/webrev/ 
>>>
>>> It's too difficult to craft my comments relative to the incremental
>>> webrev so I'm working with the full webrev still.
>>>
>>>
>>> src/hotspot/share/runtime/globals.hpp
>>>      No comments.
>>>
>>> src/hotspot/share/runtime/thread.hpp
>>>      No comments.
>>>
>>> src/hotspot/share/runtime/thread.cpp
>>>      L2953:   // This checks that the thread have a correct frame 
>>> state during a handshake
>>>          typo: s/have/has/
>>>          Also please add a '.' to end of the sentence.
>>
>> Fixed.
>>
>>>
>>>      L2954:   assert((!has_last_Java_frame() && java_call_counter() 
>>> == 0) ||
>>>      L2955:          (has_last_Java_frame() && java_call_counter() > 
>>> 0), "wrong java_sp info!");
>>>          Trying to make sure I understand what you are asserting:
>>>
>>>          1) if the thread does not have a last Java frame then it 
>>> must have
>>>             never called Java
>>>          2) if the thread has a last Java frame, then it must have 
>>> called
>>>             Java at least once
>>>
>>>          I'm good with the second expression. I have some vague 
>>> doubts about
>>>          first expression. When a JavaThread is done executing Java 
>>> code and
>>>          it is on its way toward a thread exit, is there ever a time 
>>> when it
>>>          no longer has a last Java frame? I'm thinking a handshake 
>>> late in
>>>          the JavaThread's life...
>>
>> The assert is a copy paste from
>> L2959 void JavaThread::nmethods_do(CodeBlobClosure* cf) {
>>
>> I have seen this assert trigger twice on windows, either it's wrong 
>> and should
>> be removed from nmethods_do, or it is correct and this will help us 
>> find when
>> this happens.
>
> Discussed this with Robbin via chat. I'm good with leaving
> the assert in place. Perhaps tweak both like this:
>
>     assert((!has_last_Java_frame() && java_call_counter() == 0) ||
>            (has_last_Java_frame() && java_call_counter() > 0),
>            "unexpected frame info: has_last_frame=%d, 
> java_call_counter=%d",
>            has_last_Java_frame(), java_call_counter());
>
>>
>>>
>>> src/hotspot/share/runtime/vmThread.hpp
>>>      No comments.
>>>
>>> src/hotspot/share/runtime/vmThread.cpp
>>>      L503:           MutexUnlockerEx mul(VMOperationQueue_lock, 
>>> Mutex::_no_safepoint_check_flag);
>>>          Please add something like this above L503:
>>>
>>>                      // Have to unlock VMOperationQueue_lock just in 
>>> case
>>>                      // no_op_safepoint() has to do a handshake.
>>>
>>>      old L619:     // We want to make sure that we get to a 
>>> safepoint regularly.
>>>      old L620:     //
>>>      old L621:     if ((_cur_vm_operation = 
>>> VMThread::no_op_safepoint(false)) != NULL) {
>>>          This call to no_op_safepoint() is at the bottom of the 
>>> "while(true)"
>>>          loop in VMThread::loop(). Before the fix for 8219436, this 
>>> line used to be:
>>>
>>>              if (VMThread::no_op_safepoint_needed(true)) {
>>>
>>>          and it was the only call to no_op_safepoint_needed() that 
>>> passed
>>>          'true'. It's the 'true' parameter that made the comment on 
>>> L619
>>>          correct. Why do I say that? Well, the only way we get to the
>>>          bottom of the "while(true)" loop is if we had a 
>>> vm_operation to
>>>          perform. Let's say we had a bunch of no-safepoint 
>>> vm_operations
>>>          to perform and we just kept executing them, one after another.
>>>          While doing this work, if our time between safepoints exceeds
>>>          GuaranteedSafepointInterval, then this 
>>> no_op_safepoint_needed(true)
>>>          call is what would detect that we've gone too long between 
>>> safepoints
>>>          and would force one.
>>>
>>>          With the fix for 8219436, that call was changed to 
>>> no_op_safepoint(false)
>>>          and with the v2 version of this fix that call is now gone. 
>>> So we no
>>>          longer have the ability to have GuaranteedSafepointInterval 
>>> work
>>>          right when we are doing lots of non-safepoint VM operations.
>>>
>>>          This no_op_safepoint_needed() probe point, either with 
>>> 'true' or
>>>          'false', also gave us the opportunity to force a safepoint 
>>> when
>>>          SafepointALot is true or when 
>>> SafepointSynchronize::is_cleanup_needed()
>>>          returns true. In the v2 version of this fix, we lose the 
>>> ability
>>>          to SafepointALot after each VM operation. We also lose the 
>>> ability
>>>          to do cleanup() after each VM operation.
>>
>> And I don't think that is an issue, ICache and Monitor, which is the 
>> cleanup we
>> do, just have just basic heuristic. If we cleanup every second or 
>> every other
>> second makes little difference.
>
> My point is that you are changing the semantics of SafepointALot without
> stating clearly that is what you are doing. The point of SafepointALot is
> to inject a safepoint after every VMop even those VMops that have to be
> executed at a safepoint. SafepointALot is considered a stress option for
> a very good reason.
>
>
>> SafepointALot is just diag, which causes safepoint directly after a 
>> safepoint.
>> I'd rather have it doing one extra safepoint per 
>> GuaranteedSafepointInterval,
>> that way you know how many safepoint you have per second, which is 
>> really useful
>> for benchmarking. In current form it can't be used for benchmarks.
>
> You would never use SafepointALot in a benchmark run. It is is
> stress option.
>
>>
>>>
>>>          I think you need to restore the "bool check_time" parameter to
>>>          no_op_safepoint() and you need to restore this call to as:
>>>          VMThread::no_op_safepoint_needed(true) along with the rest
>>>          of old L618-L626.
>>
>> I don't need the parameter, we can always check the time.
>> When we timeout on queue wait we will always pass the time check.
>
> As I pointed out, if you are doing non-safepoint VM ops, you can miss
> a time check.
>
>
>> So the difference is that I think the previous behavior was bad/buggy 
>> and you are trying to restore it :)
>>
>> Check-out v3, just full:
>> http://cr.openjdk.java.net/~rehn/8220774/v3/webrev/
>
> I'll post a second message with my re-review for this webrev.
>
>
>>
>> - wait on queue GuaranteedSafepointInterval
>> - if timeout
>>     - do handshake a lot
>>     - we are guaranteed to pass the time check, so cleanup if needed
>>     - else safepoint alot
>> - if op
>>     - after op
>>         - do handshake a lot
>>         - check time, if expired, cleanup if needed
>>         - else safepoint alot
>>
>> I think that is what you want :)
>> And that's what in v3.
>>
>> I would rather have:
>> - wait-time = GuaranteedSafepointInterval - 
>> SafepointTracing::time_since_last_safepoint_ms()
>> - if wait-time > 0
>>     - wait on queue
>> - else or timeout
>>     - do handshake a lot
>>     - cleanup if needed
>>     - else safepoint alot
>>
>> That way we do check for cleanup after non-safepoint in time but we 
>> do not
>> causes extra safepoints after safepoints with SafepointALot. (I know 
>> you think of this as feature :) )
>
> Not so much a feature as a stress option. I love my stress options.
>
> Dan
>
>>
>> Thanks, Robbin
>