RFR(s): 8220774: Add HandshakeALot diag option

Daniel D. Daugherty daniel.daugherty at oracle.com
Wed Mar 20 20:21:20 UTC 2019


> And here is v2 for you to consider:
> http://rehn-ws.se.oracle.com/cr_mirror/8220774/v2/webrev/ 

It's too difficult to craft my comments relative to the incremental
webrev so I'm working with the full webrev still.


src/hotspot/share/runtime/globals.hpp
     No comments.

src/hotspot/share/runtime/thread.hpp
     No comments.

src/hotspot/share/runtime/thread.cpp
     L2953:   // This checks that the thread have a correct frame state 
during a handshake
         typo: s/have/has/
         Also please add a '.' to end of the sentence.

     L2954:   assert((!has_last_Java_frame() && java_call_counter() == 0) ||
     L2955:          (has_last_Java_frame() && java_call_counter() > 0), 
"wrong java_sp info!");
         Trying to make sure I understand what you are asserting:

         1) if the thread does not have a last Java frame then it must have
            never called Java
         2) if the thread has a last Java frame, then it must have called
            Java at least once

         I'm good with the second expression. I have some vague doubts about
         first expression. When a JavaThread is done executing Java code and
         it is on its way toward a thread exit, is there ever a time when it
         no longer has a last Java frame? I'm thinking a handshake late in
         the JavaThread's life...

src/hotspot/share/runtime/vmThread.hpp
     No comments.

src/hotspot/share/runtime/vmThread.cpp
     L503:           MutexUnlockerEx mul(VMOperationQueue_lock, 
Mutex::_no_safepoint_check_flag);
         Please add something like this above L503:

                     // Have to unlock VMOperationQueue_lock just in case
                     // no_op_safepoint() has to do a handshake.

     old L619:     // We want to make sure that we get to a safepoint 
regularly.
     old L620:     //
     old L621:     if ((_cur_vm_operation = 
VMThread::no_op_safepoint(false)) != NULL) {
         This call to no_op_safepoint() is at the bottom of the 
"while(true)"
         loop in VMThread::loop(). Before the fix for 8219436, this line 
used to be:

             if (VMThread::no_op_safepoint_needed(true)) {

         and it was the only call to no_op_safepoint_needed() that passed
         'true'. It's the 'true' parameter that made the comment on L619
         correct. Why do I say that? Well, the only way we get to the
         bottom of the "while(true)" loop is if we had a vm_operation to
         perform. Let's say we had a bunch of no-safepoint vm_operations
         to perform and we just kept executing them, one after another.
         While doing this work, if our time between safepoints exceeds
         GuaranteedSafepointInterval, then this no_op_safepoint_needed(true)
         call is what would detect that we've gone too long between 
safepoints
         and would force one.

         With the fix for 8219436, that call was changed to 
no_op_safepoint(false)
         and with the v2 version of this fix that call is now gone. So we no
         longer have the ability to have GuaranteedSafepointInterval work
         right when we are doing lots of non-safepoint VM operations.

         This no_op_safepoint_needed() probe point, either with 'true' or
         'false', also gave us the opportunity to force a safepoint when
         SafepointALot is true or when 
SafepointSynchronize::is_cleanup_needed()
         returns true. In the v2 version of this fix, we lose the ability
         to SafepointALot after each VM operation. We also lose the ability
         to do cleanup() after each VM operation.

         I think you need to restore the "bool check_time" parameter to
         no_op_safepoint() and you need to restore this call to as:
         VMThread::no_op_safepoint_needed(true) along with the rest
         of old L618-L626.


     So here's the v2 version of no_op_safepoint():

       new L448: VM_Operation* VMThread::no_op_safepoint() {
       new L449:   // Do we need a real cleanup
       new L450:   if (SafepointSynchronize::is_cleanup_needed()) {
       new L451:     return &cleanup_op;
       new L452:   }
       new L453:   if (HandshakeALot) {
       new L454:     HandshakeALotTC haltc;
       new L455:     Handshake::execute(&haltc);
       new L456:   }
       new L457:   if (SafepointALot) {
       new L458:     return &safepointALot_op;
       new L459:   }
       new L460:   return NULL;
       new L461: }

     And here's the old version:

       old L437: VM_Operation* VMThread::no_op_safepoint(bool check_time) {
       old L438:   if (SafepointALot) {
       old L439:     return &safepointALot_op;
       old L440:   }
       old L441:   if (!SafepointSynchronize::is_cleanup_needed()) {
       old L442:     return NULL;
       old L443:   }
       old L444:   if (check_time) {
       old L445:     long interval_ms = 
SafepointTracing::time_since_last_safepoint_ms();
       old L446:     bool max_time_exceeded = 
GuaranteedSafepointInterval != 0 &&
       old L447:                              (interval_ms > 
GuaranteedSafepointInterval);
       old L448:     if (!max_time_exceeded) {
       old L449:       return NULL;
       old L450:     }
       old L451:   }
       old L452:   return &cleanup_op;
       old L453: }

     Ignoring the fact that 'check_time' is gone from v2 and I think you
     need to restore that (see above)...

     In the old code, SafepointALot had higher priority than 
is_cleanup_needed().
     In the new code, the order is: is_cleanup_needed(), HandshakeALot and
     SafepointALot. I think I'm okay with the change in priority order. 
If we
     have cleanup work that needs to be done, then having a cleanup 
safepoint is
     better info than SafepointALot or HandshakeALot.


     Something I just noticed: In the old code, if is_cleanup_needed() 
&& check_time
     && !max_time_exceeded, then we don't force a cleanup safepoint. 
This means that
     the probe at the bottom of the "while(true)" loop in 
VMThread::loop(), would
     only cause a cleanup safepoint when max_time_exceeded and not after 
every
     vm_operation.

     If you restore the 'check_time' parameter and restore the call to
     no_op_safepoint(true) at the bottom of the "while(true)" loop, then
     to restore the GuaranteedSafepointInterval check do it here:

       new L450:   if (SafepointSynchronize::is_cleanup_needed()) {
                     if (!check_time) {
       new L451:       return &cleanup_op;
                     }
                     long interval_ms = 
SafepointTracing::time_since_last_safepoint_ms();
                     bool max_time_exceeded = 
GuaranteedSafepointInterval != 0 &&
                                              (interval_ms > 
GuaranteedSafepointInterval);
                     if (max_time_exceeded) {
                       return &cleanup_op;
                     }
       new L452:   }

     And that will give you the same semantics for the no_op_safepoint(true)
     call at the bottom of the "while(true)" loop.

I think that's it.

Dan


On 3/20/19 11:55 AM, Robbin Ehn wrote:
> Hi Dan, thanks for having a look!
>
>> src/hotspot/share/runtime/thread.hpp
>>      L1871:   // Debug method asserting thread states
>>      L1872:   DEBUG_ONLY(void assert_states();)
>>          The comment doesn't add anything beyond what you can see with
>>          the line of code (except for the "thread" part). I think you
>>          need a better function name and a better comment.
>
> My thinking was that anyone can add whatever should be true for a 
> JavaThread during a handshake.
> Fixed.
>
>>
>> src/hotspot/share/runtime/thread.cpp
>>      L2952: void JavaThread::assert_states() {
>>      L2953:   assert((!has_last_Java_frame() && java_call_counter() 
>> == 0) ||
>>      L2954:          (has_last_Java_frame() && java_call_counter() > 
>> 0), "wrong java_sp info!");
>>          The assert failure string doesn't really match the assertion
>>          unless I missing something here...
>
> The last java frame is populated from the stack pointer.
> The assert is copy n paste from JavaThread::nmethods_do.
>
>>
>>          Also what's the reason for this assertion?
>
> I saw a issue on windows with this. I have not been able to reproduce it.
>
>>
>> src/hotspot/share/runtime/vmThread.hpp
>>      No comments.
>>
>> src/hotspot/share/runtime/vmThread.cpp
>>      L437: class HandshakeALotTC : public ThreadClosure {
>>      <snip>
>>      L439:   virtual void do_thread(Thread* thread) {
>>      <snip>
>>      L441:     JavaThread* jt = (JavaThread*)thread;
>>      L442:     DEBUG_ONLY(jt->assert_states();)
>>          Okay, but why are these thread states important here?
>
> As I said above, we should add more here as we continue.
>
>>
>>          Also L441 could also be DEBUG_ONLY. In fact the whole function
>>          body could be just #ifdef ASSERT ... #endif.
>
> Fixed.
>
>>
>>      old L437: VM_Operation* VMThread::no_op_safepoint(bool 
>> check_time) {
>>      <snip>
>>      old L444:   if (check_time) {
>>      old L445:     long interval_ms = 
>> SafepointTracing::time_since_last_safepoint_ms();
>>      old L446:     bool max_time_exceeded = 
>> GuaranteedSafepointInterval != 0 &&
>>      old L447:                              (interval_ms > 
>> GuaranteedSafepointInterval);
>>      old L448:     if (!max_time_exceeded) {
>>      old L449:       return NULL;
>>      old L450:     }
>>      old L451:   }
>>          In the existing code base, all callers to no_op_safepoint()
>>          passed 'false' so L444-L451 was dead code. It looks like
>>          before this changeset, we had both 'true' and 'false' callers:
>>
>>          changeset:   53895:b22d8ae270a2
>>          user:        rehn
>>          date:        Fri Feb 22 14:20:06 2019 +0100
>>          summary:     8219436: Safepoint logs correction and misc
>>
>>      new L446: VM_Operation* VMThread::no_op_safepoint() {
>>      new L447:   long interval_ms = 
>> SafepointTracing::time_since_last_safepoint_ms();
>>      new L448:   bool max_time_exceeded = GuaranteedSafepointInterval 
>> != 0 &&
>>      new L449:                            (interval_ms >= 
>> GuaranteedSafepointInterval);
>>      new L450:   if (!max_time_exceeded) {
>>      new L451:     return NULL;
>>      new L452:   }
>>          In the new code, L447-L452 are no longer dead and in fact the
>>          block is now first and is always checked. This means that if
>>          max_time_exceeded == false, we return NULL and there's no
>>          safepoint.
>
> In the first call the timeout from the queue will always be greater, 
> so there was no point in avoiding that check. (simpler code I thought)
>
>>
>>          This is a change in behavior for the SafepointALot flag since
>>          this code block used to be first:
>>
>>          old L438:   if (SafepointALot) {
>>          old L439:     return &safepointALot_op;
>>          old L440:   }
>>
>>          So unless I'm missing something SafepointALot now appears to
>>          be broken since it will only happen if max_time_exceeded == 
>> true
>>          and that's just a normal safepoint interval.
>
> Not really because of the:
> "We want to make sure that we get to a safepoint regularly."
> Is only executed after a real VM operation.
>
> And I needed to move it so you can run with both safepoint a lot and 
> handshake a lot and get both.
>
>
>>
>>      old L494:         if (timedout && (_cur_vm_operation = 
>> VMThread::no_op_safepoint(false)) != NULL) {
>>      old L495:           MutexUnlockerEx mul(VMOperationQueue_lock,
>>      old L496: Mutex::_no_safepoint_check_flag);
>>      new L506:           MutexUnlockerEx mul(VMOperationQueue_lock, 
>> Mutex::_no_safepoint_check_flag);
>>      new L507:           if (timedout && (_cur_vm_operation = 
>> VMThread::no_op_safepoint()) != NULL) {
>>          The old code only unlocked (and relocked) VMOperationQueue_lock
>>          when we knew that we needed to force a safepoint. Now it is 
>> always
>>          unlocked (and relocked) on this code path.
>
> Yes since if we need a handshake it must be unlocked.
> VMOperationQueue_lock is not a contended lock, unlocking and relocking 
> it every GuaranteedSafepointInterval is not an issue.
>
>>
>>      old L621:     if ((_cur_vm_operation = 
>> VMThread::no_op_safepoint(false)) != NULL) {
>>      new L633:     if ((_cur_vm_operation = 
>> VMThread::no_op_safepoint()) != NULL) {
>>          Before the fix for 8219436, this line used to be:
>>
>>              if (VMThread::no_op_safepoint_needed(true)) {
>>
>>          so the fix for 8219436 is what made old L444-L451 into
>>          dead code.
>>
>>
>> Okay, so I think that this fix breaks the SafepointALot flag.
>>
>> The fix for 8219436 broke the max_time_exceeded logic and
>> this fix solves that, but it breaks SafepointALot at the same
>> time so I think you need to rework things a bit.
>
> GuaranteedSafepointInterval only calls VMThread::no_op_safepoint, if 
> there is no cleanup needed we don't safepoint because it hurts 
> performance.
>
> If we have set SafepointALot we do safepoint on 
> GuaranteedSafepointInterval.
> (ignoring that we sometime end up in dummy code for safepointing 
> 'regulary', I'll just remove that in v2)
>
> java -XX:+UnlockDiagnosticVMOptions 
> -XX:GuaranteedSafepointInterval=100 -XX:+SafepointALot -cp . Dummy
>
> jdk/jdk:
> [1.157s][info][safepoint] Safepoint "None", Time since last: 100127512 
> ns, Reaching safepoint: 147044 ns, At safepoint: 32355 ns, Total: 
> 179399 ns
> [1.258s][info][safepoint] Safepoint "None", Time since last: 100141946 
> ns, Reaching safepoint: 190741 ns, At safepoint: 65858 ns, Total: 
> 256599 ns
> [1.358s][info][safepoint] Safepoint "None", Time since last: 100368088 
> ns, Reaching safepoint: 184854 ns, At safepoint: 51552 ns, Total: 
> 236406 ns
> [1.459s][info][safepoint] Safepoint "None", Time since last: 100186239 
> ns, Reaching safepoint: 196150 ns, At safepoint: 40827 ns, Total: 
> 236977 ns
>
> With my v1 patch:
> [2.567s][info][safepoint] Safepoint "None", Time since last: 100112625 
> ns, Reaching safepoint: 6360001 ns, At safepoint: 104928 ns, Total: 
> 6464929 ns
> [2.668s][info][safepoint] Safepoint "None", Time since last: 100117756 
> ns, Reaching safepoint: 1671472 ns, At safepoint: 89107 ns, Total: 
> 1760579 ns
> [2.770s][info][safepoint] Safepoint "None", Time since last: 100111951 
> ns, Reaching safepoint: 1849522 ns, At safepoint: 96310 ns, Total: 
> 1945832 ns
> [2.873s][info][safepoint] Safepoint "None", Time since last: 100107481 
> ns, Reaching safepoint: 1834603 ns, At safepoint: 86498 ns, Total: 
> 1921101 ns
>
> Since we do a timed wait on queue with GuaranteedSafepointInterval 
> this is what we get.
>
> The "safepoint regularly" case just causes an extra safepoint directly 
> after a safepoint with an real operation:
> [0.313s][info][safepoint] Safepoint "RevokeBias", Time since last: 
> 64643 ns, Reaching safepoint: 697102 ns, At safepoint: 76344 ns, 
> Total: 773446 ns
> [0.313s][info][safepoint] Safepoint "None", Time since last: 38694 ns, 
> Reaching safepoint: 143387 ns, At safepoint: 29936 ns, Total: 173323 ns
>
> Which is the only case this changes behavior for SafepointALot.
> Which I'll remove for v2, if you really like it, it should be in debug 
> builds only. Since the whole "safepoint regularly" is very false 
> today, we don't want to safepoint :)
>
> And here is v2 for you to consider:
> http://rehn-ws.se.oracle.com/cr_mirror/8220774/v2/webrev/
> Inc:
> http://rehn-ws.se.oracle.com/cr_mirror/8220774/v2/inc/webrev/
>
> Sanity with t1-2.
>
> Thanks, Robbin
>
>>
>> Dan
>>
>>
>>> Issue:
>>> https://bugs.openjdk.java.net/browse/JDK-8220774
>>>
>>> We need an option to generate a lot of handshakes for testing.
>>> This adds -XX:+HandshakeALot, typical usage:
>>> -XX:+UnlockDiagnosticVMOptions -XX:GuaranteedSafepointInterval=200 
>>> -XX:+HandshakeALot
>>> Will generate a extra handshake if there is no safepoint/handshake 
>>> per 200ms.
>>>
>>> Thanks, Robbin
>>>
>>



More information about the hotspot-runtime-dev mailing list