RFR(s): 8220774: Add HandshakeALot diag option

Mon Mar 25 22:22:48 UTC 2019

On 25/03/2019 11:06 pm, Robbin Ehn wrote:
> Hi David,
> 
> On 3/25/19 1:04 PM, David Holmes wrote:
>> Hi Robbin,
>>
>> Regarding the test and circular suspends ... if the _suspend_threads 
>> can end up causing a hang due to circular suspends then having the 
>> main thread issue an extra resume may not help because the 
>> _suspend_threads could re-enter a circular suspend state after those 
>> extra resumes.
> 
> That's why we issue as many as needed until semaphore count is correct.

Yeah my bad - sorry. Replied from memory instead of re-reading the code.

David

> /Robbin
> 
> 
>>
>> David
>>
>> On 25/03/2019 7:09 pm, Robbin Ehn wrote:
>>> Hi David,
>>>
>>> On 3/25/19 2:36 AM, David Holmes wrote:
>>>> Hi Robbin,
>>>>
>>>> On 23/03/2019 1:29 am, Robbin Ehn wrote:
>>>>> Hi David,
>>>>>
>>>>> On 3/22/19 8:05 AM, David Holmes wrote:
>>>>>> Hi Robbin,
>>>>>>
>>>>>> This was a little more complex than I had imagined. :) A couple of 
>>>>>> comments:
>>>>>>
>>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>>>
>>>>>>    508           // Have to unlock VMOperationQueue_lock just in case
>>>>>> no_op_safepoint()
>>>>>>    509           // has to do a handshake.
>>>>>>    510           MutexUnlockerEx mul(VMOperationQueue_lock,
>>>>>> Mutex::_no_safepoint_check_flag);
>>>>>>    511           if (timedout && (_cur_vm_operation =
>>>>>> VMThread::no_op_safepoint()) != NULL) {
>>>>>>
>>>>>> wouldn't it be better to check timedout first and only then use 
>>>>>> the unlocker
>>>>>> then check _cur_vm_operation?
>>>>>
>>>>> Fixed.
>>>>>
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>>
>>>>>> Can has_last_Java_frame() and java_call_counter() change values 
>>>>>> between their
>>>>>> use in the assert and their use in the assert message?
>>>>>
>>>>> No, the JavaThread is safepoint_safe here, thus must have a stable 
>>>>> java stack.
>>>>
>>>> Ok.
>>>>
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> Don't you want to add some tests that exercise this? Or update 
>>>>>> existing tests
>>>>>> that use SafepointALot to also use HandshakeALot?
>>>>>
>>>>> Fixed!
>>>>
>>>> I'm unclear how this fix and the renamed test relate to the existing 
>>>> bug JDK-8214174 that was referenced in the ProblemList entry ??
>>>
>>> Since we are not using withebox to walk thread stack, instead using
>>> HandshakeALot I removed Walk from name. The original could cause a 
>>> circular
>>> suspend issue.
>>>
>>>>
>>>> test/hotspot/jtreg/runtime/handshake/HandshakeSuspendExitTest.java
>>>>
>>>> If you need the loop index use the full form of the for-loop. This:
>>>>
>>>>    43             int i = 0;
>>>>    44             for (Thread thr : _suspend_threads) {
>>>>    45                 if (i++ > _suspend_threads.length -2) {
>>>>    46                     // Leave last 2 threads running.
>>>>    47                     break;
>>>>    48                 }
>>>>    49                 if (Thread.currentThread() != thr) {
>>>>
>>>> can reduce to simply:
>>>>
>>>>      // Leave last 2 threads running.
>>>>      for (int i = 0; i < _suspend_threads.length - 2; i++) {
>>>>          if (Thread.currentThread() != _suspend_threads[i]) {
>>>>
>>>
>>> Fixed.
>>>
>>>>
>>>>   67         // Wait for all suspend thread starting to loop.
>>>>
>>>> ->  // Wait for all suspend-threads to start looping.
>>>>
>>>>   75             exit_threads[i] = new Thread(new Runnable() { 
>>>> public void run() {} });
>>>>
>>>> The "new Runnable()..." is unnecessary - "new Thread();" suffices.
>>>>
>>>>
>>>>    79         // Try to suspend them.
>>>>    80         for (Thread thr : exit_threads) {
>>>>    81             thr.suspend();
>>>>    82         }
>>>>    83         for (Thread thr : exit_threads) {
>>>>    84             thr.resume();
>>>>    85         }
>>>>
>>>> there's really no guarantee of getting the suspend during "exit". 
>>>> The SuspendAtExit test seems to use logging to make it more likely 
>>>> to encounter the suspend when desired. Is there a reason not to add 
>>>> a second @run to that test to use HandShakes?
>>>
>>> The original issue I had was so unlikely that I needed tenths of 
>>> thousands of
>>> thread and looping that for several hours. (that issue is fixed)
>>> So the test just try to quickly with a small chance hit interesting 
>>> paths.
>>>
>>> We can do that also.
>>>
>>>>
>>>>    90         do {
>>>>    91             for (Thread thr : _suspend_threads) {
>>>>    92                 thr.resume();
>>>>    93             }
>>>>    94             while (_sem.tryAcquire()) {
>>>>    95                 --waiting;
>>>>    96             }
>>>>    97         } while (waiting > 0);
>>>>
>>>> why do a busy-wait and resume threads you never suspended? Why not 
>>>> just do a blocking acquire() on the semaphore? The main 
>>>> suspend/resume logic in the _suspend_threads must leave the target 
>>>> thread resumed.
>>>
>>> The _suspend_threads may create a circular suspend, so we resume them 
>>> just in case.
>>>
>>>>
>>>>> All handshakes test passed 100 iteration on each platform.
>>>>> But the "8221207: Redo JDK-8218446 - SuspendAtExit hangs" should go 
>>>>> in first.
>>>>
>>>> Okay. Just waiting for Dan's stress test results.
>>>
>>> Please see v6:
>>> Inc:  http://cr.openjdk.java.net/~rehn/8220774/v6/inc/
>>> Full: http://cr.openjdk.java.net/~rehn/8220774/v6/full/
>>>
>>> Thanks, Robbin
>>>
>>>
>>>>
>>>> Thanks,
>>>> David
>>>> -----
>>>>
>>>>
>>>>> v5 full:
>>>>> http://cr.openjdk.java.net/~rehn/8220774/v5/full/webrev/
>>>>> v5 inc:
>>>>> http://cr.openjdk.java.net/~rehn/8220774/v5/inc/webrev/
>>>>>
>>>>> Thanks, Robbin
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> David
>>>>>>
>>>>>> On 22/03/2019 2:04 am, Robbin Ehn wrote:
>>>>>>> On 2019-03-21 16:11, Daniel D. Daugherty wrote:
>>>>>>>>> Check-out v3, just full:
>>>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v3/webrev/
>>>>>>>>
>>>>>>>> src/hotspot/share/runtime/globals.hpp
>>>>>>>>       No comments.
>>>>>>>>
>>>>>>>> src/hotspot/share/runtime/thread.hpp
>>>>>>>>       No comments.
>>>>>>>>
>>>>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>>>>       L2954:   assert((!has_last_Java_frame() && 
>>>>>>>> java_call_counter() == 0) ||
>>>>>>>>       L2955:          (has_last_Java_frame() && 
>>>>>>>> java_call_counter() > 0),
>>>>>>>> "wrong java_sp info!");
>>>>>>>>          Perhaps change to this:
>>>>>>>>
>>>>>>>>                assert((!has_last_Java_frame() && 
>>>>>>>> java_call_counter() == 0) ||
>>>>>>>>                       (has_last_Java_frame() && 
>>>>>>>> java_call_counter() > 0),
>>>>>>>>                       "unexpected frame info: has_last_frame=%d,
>>>>>>>> java_call_counter=%d",
>>>>>>>>                       has_last_Java_frame(), java_call_counter());
>>>>>>>>
>>>>>>>>           and consider making the same change to the nmethod 
>>>>>>>> version.
>>>>>>>>
>>>>>>>
>>>>>>> Fixed both!
>>>>>>>
>>>>>>>> src/hotspot/share/runtime/vmThread.hpp
>>>>>>>>       No comments.
>>>>>>>>
>>>>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>>>>>       L449:   // Must check for handshakes first, since ops 
>>>>>>>> returns.
>>>>>>>>           Perhaps:
>>>>>>>>               // Check for handshakes first since we may need to 
>>>>>>>> return a VMop.
>>>>>>>>
>>>>>>>>       L454:   // Check for a true cleanup first, trying to keep 
>>>>>>>> stats correct.
>>>>>>>>           Perhaps:
>>>>>>>>               // Check for a cleanup before SafepointALot to 
>>>>>>>> keep stats correct.
>>>>>>>>
>>>>>>>>       L635:     // We want to make sure that we get to a 
>>>>>>>> safepoint regularly.
>>>>>>>>           Perhaps an addition:
>>>>>>>>                 // We want to make sure that we get to a 
>>>>>>>> safepoint regularly
>>>>>>>>                 // even when executing VMops that don't require 
>>>>>>>> safepoints.
>>>>>>>>
>>>>>>>
>>>>>>> Fixed above!
>>>>>>>
>>>>>>> Publishing v4, since I need a second review:
>>>>>>> Full:
>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v4/webrev/
>>>>>>> Inc:
>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v4/inc/webrev/
>>>>>>>
>>>>>>> Thanks, Robbin
>>>>>>>
>>>>>>>> Okay, I now grok why VMThread::no_op_safepoint() does not need a
>>>>>>>> 'check_time' parameter. Thanks for restoring the no_op_safepoint()
>>>>>>>> call at the bottom of the "while(true)" in VMThread::loop().
>>>>>>>>
>>>>>>>> If you decide to tweak the semantics for SafepointALot down the 
>>>>>>>> road,
>>>>>>>> it would be best to do that in its own bug rather than as part of
>>>>>>>> another bug.
>>>>>>>>
>>>>>>>> Thumbs up! Your call on whether to tweak the assert or change 
>>>>>>>> the comments.
>>>>>>>>
>>>>>>>> Dan
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 3/21/19 10:28 AM, Daniel D. Daugherty wrote:
>>>>>>>>> On 3/21/19 9:57 AM, Robbin Ehn wrote:
>>>>>>>>>> Hi Dan,
>>>>>>>>>>
>>>>>>>>>> On 2019-03-20 21:21, Daniel D. Daugherty wrote:
>>>>>>>>>>>> And here is v2 for you to consider:
>>>>>>>>>>>> http://rehn-ws.se.oracle.com/cr_mirror/8220774/v2/webrev/
>>>>>>>>>>>
>>>>>>>>>>> It's too difficult to craft my comments relative to the 
>>>>>>>>>>> incremental
>>>>>>>>>>> webrev so I'm working with the full webrev still.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> src/hotspot/share/runtime/globals.hpp
>>>>>>>>>>>       No comments.
>>>>>>>>>>>
>>>>>>>>>>> src/hotspot/share/runtime/thread.hpp
>>>>>>>>>>>       No comments.
>>>>>>>>>>>
>>>>>>>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>>>>>>>       L2953:   // This checks that the thread have a correct 
>>>>>>>>>>> frame state
>>>>>>>>>>> during a handshake
>>>>>>>>>>>           typo: s/have/has/
>>>>>>>>>>>           Also please add a '.' to end of the sentence.
>>>>>>>>>>
>>>>>>>>>> Fixed.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>       L2954:   assert((!has_last_Java_frame() && 
>>>>>>>>>>> java_call_counter() == 0) ||
>>>>>>>>>>>       L2955:          (has_last_Java_frame() && 
>>>>>>>>>>> java_call_counter() > 0),
>>>>>>>>>>> "wrong java_sp info!");
>>>>>>>>>>>           Trying to make sure I understand what you are 
>>>>>>>>>>> asserting:
>>>>>>>>>>>
>>>>>>>>>>>           1) if the thread does not have a last Java frame 
>>>>>>>>>>> then it must have
>>>>>>>>>>>              never called Java
>>>>>>>>>>>           2) if the thread has a last Java frame, then it 
>>>>>>>>>>> must have called
>>>>>>>>>>>              Java at least once
>>>>>>>>>>>
>>>>>>>>>>>           I'm good with the second expression. I have some 
>>>>>>>>>>> vague doubts about
>>>>>>>>>>>           first expression. When a JavaThread is done 
>>>>>>>>>>> executing Java code and
>>>>>>>>>>>           it is on its way toward a thread exit, is there 
>>>>>>>>>>> ever a time when it
>>>>>>>>>>>           no longer has a last Java frame? I'm thinking a 
>>>>>>>>>>> handshake late in
>>>>>>>>>>>           the JavaThread's life...
>>>>>>>>>>
>>>>>>>>>> The assert is a copy paste from
>>>>>>>>>> L2959 void JavaThread::nmethods_do(CodeBlobClosure* cf) {
>>>>>>>>>>
>>>>>>>>>> I have seen this assert trigger twice on windows, either it's 
>>>>>>>>>> wrong and should
>>>>>>>>>> be removed from nmethods_do, or it is correct and this will 
>>>>>>>>>> help us find when
>>>>>>>>>> this happens.
>>>>>>>>>
>>>>>>>>> Discussed this with Robbin via chat. I'm good with leaving
>>>>>>>>> the assert in place. Perhaps tweak both like this:
>>>>>>>>>
>>>>>>>>>      assert((!has_last_Java_frame() && java_call_counter() == 
>>>>>>>>> 0) ||
>>>>>>>>>             (has_last_Java_frame() && java_call_counter() > 0),
>>>>>>>>>             "unexpected frame info: has_last_frame=%d, 
>>>>>>>>> java_call_counter=%d",
>>>>>>>>>             has_last_Java_frame(), java_call_counter());
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> src/hotspot/share/runtime/vmThread.hpp
>>>>>>>>>>>       No comments.
>>>>>>>>>>>
>>>>>>>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>>>>>>>>       L503:           MutexUnlockerEx mul(VMOperationQueue_lock,
>>>>>>>>>>> Mutex::_no_safepoint_check_flag);
>>>>>>>>>>>           Please add something like this above L503:
>>>>>>>>>>>
>>>>>>>>>>>                       // Have to unlock VMOperationQueue_lock 
>>>>>>>>>>> just in case
>>>>>>>>>>>                       // no_op_safepoint() has to do a 
>>>>>>>>>>> handshake.
>>>>>>>>>>>
>>>>>>>>>>>       old L619:     // We want to make sure that we get to a 
>>>>>>>>>>> safepoint
>>>>>>>>>>> regularly.
>>>>>>>>>>>       old L620:     //
>>>>>>>>>>>       old L621:     if ((_cur_vm_operation =
>>>>>>>>>>> VMThread::no_op_safepoint(false)) != NULL) {
>>>>>>>>>>>           This call to no_op_safepoint() is at the bottom of 
>>>>>>>>>>> the "while(true)"
>>>>>>>>>>>           loop in VMThread::loop(). Before the fix for 
>>>>>>>>>>> 8219436, this line
>>>>>>>>>>> used to be:
>>>>>>>>>>>
>>>>>>>>>>>               if (VMThread::no_op_safepoint_needed(true)) {
>>>>>>>>>>>
>>>>>>>>>>>           and it was the only call to 
>>>>>>>>>>> no_op_safepoint_needed() that passed
>>>>>>>>>>>           'true'. It's the 'true' parameter that made the 
>>>>>>>>>>> comment on L619
>>>>>>>>>>>           correct. Why do I say that? Well, the only way we 
>>>>>>>>>>> get to the
>>>>>>>>>>>           bottom of the "while(true)" loop is if we had a 
>>>>>>>>>>> vm_operation to
>>>>>>>>>>>           perform. Let's say we had a bunch of no-safepoint 
>>>>>>>>>>> vm_operations
>>>>>>>>>>>           to perform and we just kept executing them, one 
>>>>>>>>>>> after another.
>>>>>>>>>>>           While doing this work, if our time between 
>>>>>>>>>>> safepoints exceeds
>>>>>>>>>>>           GuaranteedSafepointInterval, then this 
>>>>>>>>>>> no_op_safepoint_needed(true)
>>>>>>>>>>>           call is what would detect that we've gone too long 
>>>>>>>>>>> between
>>>>>>>>>>> safepoints
>>>>>>>>>>>           and would force one.
>>>>>>>>>>>
>>>>>>>>>>>           With the fix for 8219436, that call was changed to
>>>>>>>>>>> no_op_safepoint(false)
>>>>>>>>>>>           and with the v2 version of this fix that call is 
>>>>>>>>>>> now gone. So we no
>>>>>>>>>>>           longer have the ability to have 
>>>>>>>>>>> GuaranteedSafepointInterval work
>>>>>>>>>>>           right when we are doing lots of non-safepoint VM 
>>>>>>>>>>> operations.
>>>>>>>>>>>
>>>>>>>>>>>           This no_op_safepoint_needed() probe point, either 
>>>>>>>>>>> with 'true' or
>>>>>>>>>>>           'false', also gave us the opportunity to force a 
>>>>>>>>>>> safepoint when
>>>>>>>>>>>           SafepointALot is true or when
>>>>>>>>>>> SafepointSynchronize::is_cleanup_needed()
>>>>>>>>>>>           returns true. In the v2 version of this fix, we 
>>>>>>>>>>> lose the ability
>>>>>>>>>>>           to SafepointALot after each VM operation. We also 
>>>>>>>>>>> lose the ability
>>>>>>>>>>>           to do cleanup() after each VM operation.
>>>>>>>>>>
>>>>>>>>>> And I don't think that is an issue, ICache and Monitor, which 
>>>>>>>>>> is the
>>>>>>>>>> cleanup we
>>>>>>>>>> do, just have just basic heuristic. If we cleanup every second 
>>>>>>>>>> or every other
>>>>>>>>>> second makes little difference.
>>>>>>>>>
>>>>>>>>> My point is that you are changing the semantics of 
>>>>>>>>> SafepointALot without
>>>>>>>>> stating clearly that is what you are doing. The point of 
>>>>>>>>> SafepointALot is
>>>>>>>>> to inject a safepoint after every VMop even those VMops that 
>>>>>>>>> have to be
>>>>>>>>> executed at a safepoint. SafepointALot is considered a stress 
>>>>>>>>> option for
>>>>>>>>> a very good reason.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> SafepointALot is just diag, which causes safepoint directly 
>>>>>>>>>> after a safepoint.
>>>>>>>>>> I'd rather have it doing one extra safepoint per 
>>>>>>>>>> GuaranteedSafepointInterval,
>>>>>>>>>> that way you know how many safepoint you have per second, 
>>>>>>>>>> which is really
>>>>>>>>>> useful
>>>>>>>>>> for benchmarking. In current form it can't be used for 
>>>>>>>>>> benchmarks.
>>>>>>>>>
>>>>>>>>> You would never use SafepointALot in a benchmark run. It is is
>>>>>>>>> stress option.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>           I think you need to restore the "bool check_time" 
>>>>>>>>>>> parameter to
>>>>>>>>>>>           no_op_safepoint() and you need to restore this call 
>>>>>>>>>>> to as:
>>>>>>>>>>>           VMThread::no_op_safepoint_needed(true) along with 
>>>>>>>>>>> the rest
>>>>>>>>>>>           of old L618-L626.
>>>>>>>>>>
>>>>>>>>>> I don't need the parameter, we can always check the time.
>>>>>>>>>> When we timeout on queue wait we will always pass the time check.
>>>>>>>>>
>>>>>>>>> As I pointed out, if you are doing non-safepoint VM ops, you 
>>>>>>>>> can miss
>>>>>>>>> a time check.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> So the difference is that I think the previous behavior was 
>>>>>>>>>> bad/buggy and
>>>>>>>>>> you are trying to restore it :)
>>>>>>>>>>
>>>>>>>>>> Check-out v3, just full:
>>>>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v3/webrev/
>>>>>>>>>
>>>>>>>>> I'll post a second message with my re-review for this webrev.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> - wait on queue GuaranteedSafepointInterval
>>>>>>>>>> - if timeout
>>>>>>>>>>      - do handshake a lot
>>>>>>>>>>      - we are guaranteed to pass the time check, so cleanup if 
>>>>>>>>>> needed
>>>>>>>>>>      - else safepoint alot
>>>>>>>>>> - if op
>>>>>>>>>>      - after op
>>>>>>>>>>          - do handshake a lot
>>>>>>>>>>          - check time, if expired, cleanup if needed
>>>>>>>>>>          - else safepoint alot
>>>>>>>>>>
>>>>>>>>>> I think that is what you want :)
>>>>>>>>>> And that's what in v3.
>>>>>>>>>>
>>>>>>>>>> I would rather have:
>>>>>>>>>> - wait-time = GuaranteedSafepointInterval -
>>>>>>>>>> SafepointTracing::time_since_last_safepoint_ms()
>>>>>>>>>> - if wait-time > 0
>>>>>>>>>>      - wait on queue
>>>>>>>>>> - else or timeout
>>>>>>>>>>      - do handshake a lot
>>>>>>>>>>      - cleanup if needed
>>>>>>>>>>      - else safepoint alot
>>>>>>>>>>
>>>>>>>>>> That way we do check for cleanup after non-safepoint in time 
>>>>>>>>>> but we do not
>>>>>>>>>> causes extra safepoints after safepoints with SafepointALot. 
>>>>>>>>>> (I know you
>>>>>>>>>> think of this as feature :) )
>>>>>>>>>
>>>>>>>>> Not so much a feature as a stress option. I love my stress 
>>>>>>>>> options.
>>>>>>>>>
>>>>>>>>> Dan
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks, Robbin
>>>>>>>>>
>>>>>>>>