RFR(s): 8220774: Add HandshakeALot diag option

Mon Mar 25 01:36:54 UTC 2019

Hi Robbin,

On 23/03/2019 1:29 am, Robbin Ehn wrote:
> Hi David,
> 
> On 3/22/19 8:05 AM, David Holmes wrote:
>> Hi Robbin,
>>
>> This was a little more complex than I had imagined. :) A couple of comments:
>>
>> src/hotspot/share/runtime/vmThread.cpp
>>
>>    508           // Have to unlock VMOperationQueue_lock just in case
>> no_op_safepoint()
>>    509           // has to do a handshake.
>>    510           MutexUnlockerEx mul(VMOperationQueue_lock,
>> Mutex::_no_safepoint_check_flag);
>>    511           if (timedout && (_cur_vm_operation =
>> VMThread::no_op_safepoint()) != NULL) {
>>
>> wouldn't it be better to check timedout first and only then use the unlocker
>> then check _cur_vm_operation?
> 
> Fixed.
> 
>>
>> ---
>>
>> src/hotspot/share/runtime/thread.cpp
>>
>> Can has_last_Java_frame() and java_call_counter() change values between their
>> use in the assert and their use in the assert message?
> 
> No, the JavaThread is safepoint_safe here, thus must have a stable java stack.

Ok.

>>
>> ---
>>
>> Don't you want to add some tests that exercise this? Or update existing tests
>> that use SafepointALot to also use HandshakeALot?
> 
> Fixed!

I'm unclear how this fix and the renamed test relate to the existing bug 
JDK-8214174 that was referenced in the ProblemList entry ??

test/hotspot/jtreg/runtime/handshake/HandshakeSuspendExitTest.java

If you need the loop index use the full form of the for-loop. This:

   43             int i = 0;
   44             for (Thread thr : _suspend_threads) {
   45                 if (i++ > _suspend_threads.length -2) {
   46                     // Leave last 2 threads running.
   47                     break;
   48                 }
   49                 if (Thread.currentThread() != thr) {

can reduce to simply:

     // Leave last 2 threads running.
     for (int i = 0; i < _suspend_threads.length - 2; i++) {
         if (Thread.currentThread() != _suspend_threads[i]) {

  67         // Wait for all suspend thread starting to loop.

->  // Wait for all suspend-threads to start looping.

  75             exit_threads[i] = new Thread(new Runnable() { public 
void run() {} });

The "new Runnable()..." is unnecessary - "new Thread();" suffices.

   79         // Try to suspend them.
   80         for (Thread thr : exit_threads) {
   81             thr.suspend();
   82         }
   83         for (Thread thr : exit_threads) {
   84             thr.resume();
   85         }

there's really no guarantee of getting the suspend during "exit". The 
SuspendAtExit test seems to use logging to make it more likely to 
encounter the suspend when desired. Is there a reason not to add a 
second @run to that test to use HandShakes?

   90         do {
   91             for (Thread thr : _suspend_threads) {
   92                 thr.resume();
   93             }
   94             while (_sem.tryAcquire()) {
   95                 --waiting;
   96             }
   97         } while (waiting > 0);

why do a busy-wait and resume threads you never suspended? Why not just 
do a blocking acquire() on the semaphore? The main suspend/resume logic 
in the _suspend_threads must leave the target thread resumed.

> All handshakes test passed 100 iteration on each platform.
> But the "8221207: Redo JDK-8218446 - SuspendAtExit hangs" should go in first.

Okay. Just waiting for Dan's stress test results.

Thanks,
David
-----

> v5 full:
> http://cr.openjdk.java.net/~rehn/8220774/v5/full/webrev/
> v5 inc:
> http://cr.openjdk.java.net/~rehn/8220774/v5/inc/webrev/
> 
> Thanks, Robbin
> 
>>
>> Thanks,
>> David
>>
>> On 22/03/2019 2:04 am, Robbin Ehn wrote:
>>> On 2019-03-21 16:11, Daniel D. Daugherty wrote:
>>>>> Check-out v3, just full:
>>>>> http://cr.openjdk.java.net/~rehn/8220774/v3/webrev/
>>>>
>>>> src/hotspot/share/runtime/globals.hpp
>>>>       No comments.
>>>>
>>>> src/hotspot/share/runtime/thread.hpp
>>>>       No comments.
>>>>
>>>> src/hotspot/share/runtime/thread.cpp
>>>>       L2954:   assert((!has_last_Java_frame() && java_call_counter() == 0) ||
>>>>       L2955:          (has_last_Java_frame() && java_call_counter() > 0),
>>>> "wrong java_sp info!");
>>>>          Perhaps change to this:
>>>>
>>>>                assert((!has_last_Java_frame() && java_call_counter() == 0) ||
>>>>                       (has_last_Java_frame() && java_call_counter() > 0),
>>>>                       "unexpected frame info: has_last_frame=%d,
>>>> java_call_counter=%d",
>>>>                       has_last_Java_frame(), java_call_counter());
>>>>
>>>>           and consider making the same change to the nmethod version.
>>>>
>>>
>>> Fixed both!
>>>
>>>> src/hotspot/share/runtime/vmThread.hpp
>>>>       No comments.
>>>>
>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>       L449:   // Must check for handshakes first, since ops returns.
>>>>           Perhaps:
>>>>               // Check for handshakes first since we may need to return a VMop.
>>>>
>>>>       L454:   // Check for a true cleanup first, trying to keep stats correct.
>>>>           Perhaps:
>>>>               // Check for a cleanup before SafepointALot to keep stats correct.
>>>>
>>>>       L635:     // We want to make sure that we get to a safepoint regularly.
>>>>           Perhaps an addition:
>>>>                 // We want to make sure that we get to a safepoint regularly
>>>>                 // even when executing VMops that don't require safepoints.
>>>>
>>>
>>> Fixed above!
>>>
>>> Publishing v4, since I need a second review:
>>> Full:
>>> http://cr.openjdk.java.net/~rehn/8220774/v4/webrev/
>>> Inc:
>>> http://cr.openjdk.java.net/~rehn/8220774/v4/inc/webrev/
>>>
>>> Thanks, Robbin
>>>
>>>> Okay, I now grok why VMThread::no_op_safepoint() does not need a
>>>> 'check_time' parameter. Thanks for restoring the no_op_safepoint()
>>>> call at the bottom of the "while(true)" in VMThread::loop().
>>>>
>>>> If you decide to tweak the semantics for SafepointALot down the road,
>>>> it would be best to do that in its own bug rather than as part of
>>>> another bug.
>>>>
>>>> Thumbs up! Your call on whether to tweak the assert or change the comments.
>>>>
>>>> Dan
>>>>
>>>>
>>>>
>>>> On 3/21/19 10:28 AM, Daniel D. Daugherty wrote:
>>>>> On 3/21/19 9:57 AM, Robbin Ehn wrote:
>>>>>> Hi Dan,
>>>>>>
>>>>>> On 2019-03-20 21:21, Daniel D. Daugherty wrote:
>>>>>>>> And here is v2 for you to consider:
>>>>>>>> http://rehn-ws.se.oracle.com/cr_mirror/8220774/v2/webrev/
>>>>>>>
>>>>>>> It's too difficult to craft my comments relative to the incremental
>>>>>>> webrev so I'm working with the full webrev still.
>>>>>>>
>>>>>>>
>>>>>>> src/hotspot/share/runtime/globals.hpp
>>>>>>>       No comments.
>>>>>>>
>>>>>>> src/hotspot/share/runtime/thread.hpp
>>>>>>>       No comments.
>>>>>>>
>>>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>>>       L2953:   // This checks that the thread have a correct frame state
>>>>>>> during a handshake
>>>>>>>           typo: s/have/has/
>>>>>>>           Also please add a '.' to end of the sentence.
>>>>>>
>>>>>> Fixed.
>>>>>>
>>>>>>>
>>>>>>>       L2954:   assert((!has_last_Java_frame() && java_call_counter() == 0) ||
>>>>>>>       L2955:          (has_last_Java_frame() && java_call_counter() > 0),
>>>>>>> "wrong java_sp info!");
>>>>>>>           Trying to make sure I understand what you are asserting:
>>>>>>>
>>>>>>>           1) if the thread does not have a last Java frame then it must have
>>>>>>>              never called Java
>>>>>>>           2) if the thread has a last Java frame, then it must have called
>>>>>>>              Java at least once
>>>>>>>
>>>>>>>           I'm good with the second expression. I have some vague doubts about
>>>>>>>           first expression. When a JavaThread is done executing Java code and
>>>>>>>           it is on its way toward a thread exit, is there ever a time when it
>>>>>>>           no longer has a last Java frame? I'm thinking a handshake late in
>>>>>>>           the JavaThread's life...
>>>>>>
>>>>>> The assert is a copy paste from
>>>>>> L2959 void JavaThread::nmethods_do(CodeBlobClosure* cf) {
>>>>>>
>>>>>> I have seen this assert trigger twice on windows, either it's wrong and should
>>>>>> be removed from nmethods_do, or it is correct and this will help us find when
>>>>>> this happens.
>>>>>
>>>>> Discussed this with Robbin via chat. I'm good with leaving
>>>>> the assert in place. Perhaps tweak both like this:
>>>>>
>>>>>      assert((!has_last_Java_frame() && java_call_counter() == 0) ||
>>>>>             (has_last_Java_frame() && java_call_counter() > 0),
>>>>>             "unexpected frame info: has_last_frame=%d, java_call_counter=%d",
>>>>>             has_last_Java_frame(), java_call_counter());
>>>>>
>>>>>>
>>>>>>>
>>>>>>> src/hotspot/share/runtime/vmThread.hpp
>>>>>>>       No comments.
>>>>>>>
>>>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>>>>       L503:           MutexUnlockerEx mul(VMOperationQueue_lock,
>>>>>>> Mutex::_no_safepoint_check_flag);
>>>>>>>           Please add something like this above L503:
>>>>>>>
>>>>>>>                       // Have to unlock VMOperationQueue_lock just in case
>>>>>>>                       // no_op_safepoint() has to do a handshake.
>>>>>>>
>>>>>>>       old L619:     // We want to make sure that we get to a safepoint
>>>>>>> regularly.
>>>>>>>       old L620:     //
>>>>>>>       old L621:     if ((_cur_vm_operation =
>>>>>>> VMThread::no_op_safepoint(false)) != NULL) {
>>>>>>>           This call to no_op_safepoint() is at the bottom of the "while(true)"
>>>>>>>           loop in VMThread::loop(). Before the fix for 8219436, this line
>>>>>>> used to be:
>>>>>>>
>>>>>>>               if (VMThread::no_op_safepoint_needed(true)) {
>>>>>>>
>>>>>>>           and it was the only call to no_op_safepoint_needed() that passed
>>>>>>>           'true'. It's the 'true' parameter that made the comment on L619
>>>>>>>           correct. Why do I say that? Well, the only way we get to the
>>>>>>>           bottom of the "while(true)" loop is if we had a vm_operation to
>>>>>>>           perform. Let's say we had a bunch of no-safepoint vm_operations
>>>>>>>           to perform and we just kept executing them, one after another.
>>>>>>>           While doing this work, if our time between safepoints exceeds
>>>>>>>           GuaranteedSafepointInterval, then this no_op_safepoint_needed(true)
>>>>>>>           call is what would detect that we've gone too long between
>>>>>>> safepoints
>>>>>>>           and would force one.
>>>>>>>
>>>>>>>           With the fix for 8219436, that call was changed to
>>>>>>> no_op_safepoint(false)
>>>>>>>           and with the v2 version of this fix that call is now gone. So we no
>>>>>>>           longer have the ability to have GuaranteedSafepointInterval work
>>>>>>>           right when we are doing lots of non-safepoint VM operations.
>>>>>>>
>>>>>>>           This no_op_safepoint_needed() probe point, either with 'true' or
>>>>>>>           'false', also gave us the opportunity to force a safepoint when
>>>>>>>           SafepointALot is true or when
>>>>>>> SafepointSynchronize::is_cleanup_needed()
>>>>>>>           returns true. In the v2 version of this fix, we lose the ability
>>>>>>>           to SafepointALot after each VM operation. We also lose the ability
>>>>>>>           to do cleanup() after each VM operation.
>>>>>>
>>>>>> And I don't think that is an issue, ICache and Monitor, which is the
>>>>>> cleanup we
>>>>>> do, just have just basic heuristic. If we cleanup every second or every other
>>>>>> second makes little difference.
>>>>>
>>>>> My point is that you are changing the semantics of SafepointALot without
>>>>> stating clearly that is what you are doing. The point of SafepointALot is
>>>>> to inject a safepoint after every VMop even those VMops that have to be
>>>>> executed at a safepoint. SafepointALot is considered a stress option for
>>>>> a very good reason.
>>>>>
>>>>>
>>>>>> SafepointALot is just diag, which causes safepoint directly after a safepoint.
>>>>>> I'd rather have it doing one extra safepoint per GuaranteedSafepointInterval,
>>>>>> that way you know how many safepoint you have per second, which is really
>>>>>> useful
>>>>>> for benchmarking. In current form it can't be used for benchmarks.
>>>>>
>>>>> You would never use SafepointALot in a benchmark run. It is is
>>>>> stress option.
>>>>>
>>>>>>
>>>>>>>
>>>>>>>           I think you need to restore the "bool check_time" parameter to
>>>>>>>           no_op_safepoint() and you need to restore this call to as:
>>>>>>>           VMThread::no_op_safepoint_needed(true) along with the rest
>>>>>>>           of old L618-L626.
>>>>>>
>>>>>> I don't need the parameter, we can always check the time.
>>>>>> When we timeout on queue wait we will always pass the time check.
>>>>>
>>>>> As I pointed out, if you are doing non-safepoint VM ops, you can miss
>>>>> a time check.
>>>>>
>>>>>
>>>>>> So the difference is that I think the previous behavior was bad/buggy and
>>>>>> you are trying to restore it :)
>>>>>>
>>>>>> Check-out v3, just full:
>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v3/webrev/
>>>>>
>>>>> I'll post a second message with my re-review for this webrev.
>>>>>
>>>>>
>>>>>>
>>>>>> - wait on queue GuaranteedSafepointInterval
>>>>>> - if timeout
>>>>>>      - do handshake a lot
>>>>>>      - we are guaranteed to pass the time check, so cleanup if needed
>>>>>>      - else safepoint alot
>>>>>> - if op
>>>>>>      - after op
>>>>>>          - do handshake a lot
>>>>>>          - check time, if expired, cleanup if needed
>>>>>>          - else safepoint alot
>>>>>>
>>>>>> I think that is what you want :)
>>>>>> And that's what in v3.
>>>>>>
>>>>>> I would rather have:
>>>>>> - wait-time = GuaranteedSafepointInterval -
>>>>>> SafepointTracing::time_since_last_safepoint_ms()
>>>>>> - if wait-time > 0
>>>>>>      - wait on queue
>>>>>> - else or timeout
>>>>>>      - do handshake a lot
>>>>>>      - cleanup if needed
>>>>>>      - else safepoint alot
>>>>>>
>>>>>> That way we do check for cleanup after non-safepoint in time but we do not
>>>>>> causes extra safepoints after safepoints with SafepointALot. (I know you
>>>>>> think of this as feature :) )
>>>>>
>>>>> Not so much a feature as a stress option. I love my stress options.
>>>>>
>>>>> Dan
>>>>>
>>>>>>
>>>>>> Thanks, Robbin
>>>>>
>>>>