RFR(s): 8220774: Add HandshakeALot diag option

Robbin Ehn robbin.ehn at oracle.com
Tue Mar 26 07:17:21 UTC 2019


> Yep all good!

Great, thanks!

/Robbin

> 
> Thanks,
> David
> 
>> Thanks, Robbin
>>
>>
>>>
>>> Cheers,
>>> David
>>>
>>>> Thumbs up!
>>>>
>>>> Dan
>>>>
>>>>
>>>>
>>>> On 3/25/19 5:09 AM, Robbin Ehn wrote:
>>>>> Hi David,
>>>>>
>>>>> On 3/25/19 2:36 AM, David Holmes wrote:
>>>>>> Hi Robbin,
>>>>>>
>>>>>> On 23/03/2019 1:29 am, Robbin Ehn wrote:
>>>>>>> Hi David,
>>>>>>>
>>>>>>> On 3/22/19 8:05 AM, David Holmes wrote:
>>>>>>>> Hi Robbin,
>>>>>>>>
>>>>>>>> This was a little more complex than I had imagined. :) A couple of 
>>>>>>>> comments:
>>>>>>>>
>>>>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>>>>>
>>>>>>>>    508           // Have to unlock VMOperationQueue_lock just in case
>>>>>>>> no_op_safepoint()
>>>>>>>>    509           // has to do a handshake.
>>>>>>>>    510           MutexUnlockerEx mul(VMOperationQueue_lock,
>>>>>>>> Mutex::_no_safepoint_check_flag);
>>>>>>>>    511           if (timedout && (_cur_vm_operation =
>>>>>>>> VMThread::no_op_safepoint()) != NULL) {
>>>>>>>>
>>>>>>>> wouldn't it be better to check timedout first and only then use the 
>>>>>>>> unlocker
>>>>>>>> then check _cur_vm_operation?
>>>>>>>
>>>>>>> Fixed.
>>>>>>>
>>>>>>>>
>>>>>>>> ---
>>>>>>>>
>>>>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>>>>
>>>>>>>> Can has_last_Java_frame() and java_call_counter() change values between 
>>>>>>>> their
>>>>>>>> use in the assert and their use in the assert message?
>>>>>>>
>>>>>>> No, the JavaThread is safepoint_safe here, thus must have a stable java 
>>>>>>> stack.
>>>>>>
>>>>>> Ok.
>>>>>>
>>>>>>>>
>>>>>>>> ---
>>>>>>>>
>>>>>>>> Don't you want to add some tests that exercise this? Or update existing 
>>>>>>>> tests
>>>>>>>> that use SafepointALot to also use HandshakeALot?
>>>>>>>
>>>>>>> Fixed!
>>>>>>
>>>>>> I'm unclear how this fix and the renamed test relate to the existing bug 
>>>>>> JDK-8214174 that was referenced in the ProblemList entry ??
>>>>>
>>>>> Since we are not using withebox to walk thread stack, instead using
>>>>> HandshakeALot I removed Walk from name. The original could cause a circular
>>>>> suspend issue.
>>>>>
>>>>>>
>>>>>> test/hotspot/jtreg/runtime/handshake/HandshakeSuspendExitTest.java
>>>>>>
>>>>>> If you need the loop index use the full form of the for-loop. This:
>>>>>>
>>>>>>    43             int i = 0;
>>>>>>    44             for (Thread thr : _suspend_threads) {
>>>>>>    45                 if (i++ > _suspend_threads.length -2) {
>>>>>>    46                     // Leave last 2 threads running.
>>>>>>    47                     break;
>>>>>>    48                 }
>>>>>>    49                 if (Thread.currentThread() != thr) {
>>>>>>
>>>>>> can reduce to simply:
>>>>>>
>>>>>>      // Leave last 2 threads running.
>>>>>>      for (int i = 0; i < _suspend_threads.length - 2; i++) {
>>>>>>          if (Thread.currentThread() != _suspend_threads[i]) {
>>>>>>
>>>>>
>>>>> Fixed.
>>>>>
>>>>>>
>>>>>>   67         // Wait for all suspend thread starting to loop.
>>>>>>
>>>>>> ->  // Wait for all suspend-threads to start looping.
>>>>>>
>>>>>>   75             exit_threads[i] = new Thread(new Runnable() { public void 
>>>>>> run() {} });
>>>>>>
>>>>>> The "new Runnable()..." is unnecessary - "new Thread();" suffices.
>>>>>>
>>>>>>
>>>>>>    79         // Try to suspend them.
>>>>>>    80         for (Thread thr : exit_threads) {
>>>>>>    81             thr.suspend();
>>>>>>    82         }
>>>>>>    83         for (Thread thr : exit_threads) {
>>>>>>    84             thr.resume();
>>>>>>    85         }
>>>>>>
>>>>>> there's really no guarantee of getting the suspend during "exit". The 
>>>>>> SuspendAtExit test seems to use logging to make it more likely to 
>>>>>> encounter the suspend when desired. Is there a reason not to add a second 
>>>>>> @run to that test to use HandShakes?
>>>>>
>>>>> The original issue I had was so unlikely that I needed tenths of thousands of
>>>>> thread and looping that for several hours. (that issue is fixed)
>>>>> So the test just try to quickly with a small chance hit interesting paths.
>>>>>
>>>>> We can do that also.
>>>>>
>>>>>>
>>>>>>    90         do {
>>>>>>    91             for (Thread thr : _suspend_threads) {
>>>>>>    92                 thr.resume();
>>>>>>    93             }
>>>>>>    94             while (_sem.tryAcquire()) {
>>>>>>    95                 --waiting;
>>>>>>    96             }
>>>>>>    97         } while (waiting > 0);
>>>>>>
>>>>>> why do a busy-wait and resume threads you never suspended? Why not just do 
>>>>>> a blocking acquire() on the semaphore? The main suspend/resume logic in 
>>>>>> the _suspend_threads must leave the target thread resumed.
>>>>>
>>>>> The _suspend_threads may create a circular suspend, so we resume them just 
>>>>> in case.
>>>>>
>>>>>>
>>>>>>> All handshakes test passed 100 iteration on each platform.
>>>>>>> But the "8221207: Redo JDK-8218446 - SuspendAtExit hangs" should go in 
>>>>>>> first.
>>>>>>
>>>>>> Okay. Just waiting for Dan's stress test results.
>>>>>
>>>>> Please see v6:
>>>>> Inc:  http://cr.openjdk.java.net/~rehn/8220774/v6/inc/
>>>>> Full: http://cr.openjdk.java.net/~rehn/8220774/v6/full/
>>>>>
>>>>> Thanks, Robbin
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> David
>>>>>> -----
>>>>>>
>>>>>>
>>>>>>> v5 full:
>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v5/full/webrev/
>>>>>>> v5 inc:
>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v5/inc/webrev/
>>>>>>>
>>>>>>> Thanks, Robbin
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> David
>>>>>>>>
>>>>>>>> On 22/03/2019 2:04 am, Robbin Ehn wrote:
>>>>>>>>> On 2019-03-21 16:11, Daniel D. Daugherty wrote:
>>>>>>>>>>> Check-out v3, just full:
>>>>>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v3/webrev/
>>>>>>>>>>
>>>>>>>>>> src/hotspot/share/runtime/globals.hpp
>>>>>>>>>>       No comments.
>>>>>>>>>>
>>>>>>>>>> src/hotspot/share/runtime/thread.hpp
>>>>>>>>>>       No comments.
>>>>>>>>>>
>>>>>>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>>>>>>       L2954:   assert((!has_last_Java_frame() && java_call_counter() 
>>>>>>>>>> == 0) ||
>>>>>>>>>>       L2955:          (has_last_Java_frame() && java_call_counter() > 0),
>>>>>>>>>> "wrong java_sp info!");
>>>>>>>>>>          Perhaps change to this:
>>>>>>>>>>
>>>>>>>>>>                assert((!has_last_Java_frame() && java_call_counter() 
>>>>>>>>>> == 0) ||
>>>>>>>>>>                       (has_last_Java_frame() && java_call_counter() > 0),
>>>>>>>>>>                       "unexpected frame info: has_last_frame=%d,
>>>>>>>>>> java_call_counter=%d",
>>>>>>>>>>                       has_last_Java_frame(), java_call_counter());
>>>>>>>>>>
>>>>>>>>>>           and consider making the same change to the nmethod version.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Fixed both!
>>>>>>>>>
>>>>>>>>>> src/hotspot/share/runtime/vmThread.hpp
>>>>>>>>>>       No comments.
>>>>>>>>>>
>>>>>>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>>>>>>>       L449:   // Must check for handshakes first, since ops returns.
>>>>>>>>>>           Perhaps:
>>>>>>>>>>               // Check for handshakes first since we may need to 
>>>>>>>>>> return a VMop.
>>>>>>>>>>
>>>>>>>>>>       L454:   // Check for a true cleanup first, trying to keep stats 
>>>>>>>>>> correct.
>>>>>>>>>>           Perhaps:
>>>>>>>>>>               // Check for a cleanup before SafepointALot to keep 
>>>>>>>>>> stats correct.
>>>>>>>>>>
>>>>>>>>>>       L635:     // We want to make sure that we get to a safepoint 
>>>>>>>>>> regularly.
>>>>>>>>>>           Perhaps an addition:
>>>>>>>>>>                 // We want to make sure that we get to a safepoint 
>>>>>>>>>> regularly
>>>>>>>>>>                 // even when executing VMops that don't require 
>>>>>>>>>> safepoints.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Fixed above!
>>>>>>>>>
>>>>>>>>> Publishing v4, since I need a second review:
>>>>>>>>> Full:
>>>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v4/webrev/
>>>>>>>>> Inc:
>>>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v4/inc/webrev/
>>>>>>>>>
>>>>>>>>> Thanks, Robbin
>>>>>>>>>
>>>>>>>>>> Okay, I now grok why VMThread::no_op_safepoint() does not need a
>>>>>>>>>> 'check_time' parameter. Thanks for restoring the no_op_safepoint()
>>>>>>>>>> call at the bottom of the "while(true)" in VMThread::loop().
>>>>>>>>>>
>>>>>>>>>> If you decide to tweak the semantics for SafepointALot down the road,
>>>>>>>>>> it would be best to do that in its own bug rather than as part of
>>>>>>>>>> another bug.
>>>>>>>>>>
>>>>>>>>>> Thumbs up! Your call on whether to tweak the assert or change the 
>>>>>>>>>> comments.
>>>>>>>>>>
>>>>>>>>>> Dan
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 3/21/19 10:28 AM, Daniel D. Daugherty wrote:
>>>>>>>>>>> On 3/21/19 9:57 AM, Robbin Ehn wrote:
>>>>>>>>>>>> Hi Dan,
>>>>>>>>>>>>
>>>>>>>>>>>> On 2019-03-20 21:21, Daniel D. Daugherty wrote:
>>>>>>>>>>>>>> And here is v2 for you to consider:
>>>>>>>>>>>>>> http://rehn-ws.se.oracle.com/cr_mirror/8220774/v2/webrev/
>>>>>>>>>>>>>
>>>>>>>>>>>>> It's too difficult to craft my comments relative to the incremental
>>>>>>>>>>>>> webrev so I'm working with the full webrev still.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> src/hotspot/share/runtime/globals.hpp
>>>>>>>>>>>>>       No comments.
>>>>>>>>>>>>>
>>>>>>>>>>>>> src/hotspot/share/runtime/thread.hpp
>>>>>>>>>>>>>       No comments.
>>>>>>>>>>>>>
>>>>>>>>>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>>>>>>>>>       L2953:   // This checks that the thread have a correct frame 
>>>>>>>>>>>>> state
>>>>>>>>>>>>> during a handshake
>>>>>>>>>>>>>           typo: s/have/has/
>>>>>>>>>>>>>           Also please add a '.' to end of the sentence.
>>>>>>>>>>>>
>>>>>>>>>>>> Fixed.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>       L2954:   assert((!has_last_Java_frame() && 
>>>>>>>>>>>>> java_call_counter() == 0) ||
>>>>>>>>>>>>>       L2955:          (has_last_Java_frame() && java_call_counter() 
>>>>>>>>>>>>> > 0),
>>>>>>>>>>>>> "wrong java_sp info!");
>>>>>>>>>>>>>           Trying to make sure I understand what you are asserting:
>>>>>>>>>>>>>
>>>>>>>>>>>>>           1) if the thread does not have a last Java frame then it 
>>>>>>>>>>>>> must have
>>>>>>>>>>>>>              never called Java
>>>>>>>>>>>>>           2) if the thread has a last Java frame, then it must have 
>>>>>>>>>>>>> called
>>>>>>>>>>>>>              Java at least once
>>>>>>>>>>>>>
>>>>>>>>>>>>>           I'm good with the second expression. I have some vague 
>>>>>>>>>>>>> doubts about
>>>>>>>>>>>>>           first expression. When a JavaThread is done executing 
>>>>>>>>>>>>> Java code and
>>>>>>>>>>>>>           it is on its way toward a thread exit, is there ever a 
>>>>>>>>>>>>> time when it
>>>>>>>>>>>>>           no longer has a last Java frame? I'm thinking a handshake 
>>>>>>>>>>>>> late in
>>>>>>>>>>>>>           the JavaThread's life...
>>>>>>>>>>>>
>>>>>>>>>>>> The assert is a copy paste from
>>>>>>>>>>>> L2959 void JavaThread::nmethods_do(CodeBlobClosure* cf) {
>>>>>>>>>>>>
>>>>>>>>>>>> I have seen this assert trigger twice on windows, either it's wrong 
>>>>>>>>>>>> and should
>>>>>>>>>>>> be removed from nmethods_do, or it is correct and this will help us 
>>>>>>>>>>>> find when
>>>>>>>>>>>> this happens.
>>>>>>>>>>>
>>>>>>>>>>> Discussed this with Robbin via chat. I'm good with leaving
>>>>>>>>>>> the assert in place. Perhaps tweak both like this:
>>>>>>>>>>>
>>>>>>>>>>>      assert((!has_last_Java_frame() && java_call_counter() == 0) ||
>>>>>>>>>>>             (has_last_Java_frame() && java_call_counter() > 0),
>>>>>>>>>>>             "unexpected frame info: has_last_frame=%d, 
>>>>>>>>>>> java_call_counter=%d",
>>>>>>>>>>>             has_last_Java_frame(), java_call_counter());
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> src/hotspot/share/runtime/vmThread.hpp
>>>>>>>>>>>>>       No comments.
>>>>>>>>>>>>>
>>>>>>>>>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>>>>>>>>>>       L503:           MutexUnlockerEx mul(VMOperationQueue_lock,
>>>>>>>>>>>>> Mutex::_no_safepoint_check_flag);
>>>>>>>>>>>>>           Please add something like this above L503:
>>>>>>>>>>>>>
>>>>>>>>>>>>>                       // Have to unlock VMOperationQueue_lock just 
>>>>>>>>>>>>> in case
>>>>>>>>>>>>>                       // no_op_safepoint() has to do a handshake.
>>>>>>>>>>>>>
>>>>>>>>>>>>>       old L619:     // We want to make sure that we get to a safepoint
>>>>>>>>>>>>> regularly.
>>>>>>>>>>>>>       old L620:     //
>>>>>>>>>>>>>       old L621:     if ((_cur_vm_operation =
>>>>>>>>>>>>> VMThread::no_op_safepoint(false)) != NULL) {
>>>>>>>>>>>>>           This call to no_op_safepoint() is at the bottom of the 
>>>>>>>>>>>>> "while(true)"
>>>>>>>>>>>>>           loop in VMThread::loop(). Before the fix for 8219436, 
>>>>>>>>>>>>> this line
>>>>>>>>>>>>> used to be:
>>>>>>>>>>>>>
>>>>>>>>>>>>>               if (VMThread::no_op_safepoint_needed(true)) {
>>>>>>>>>>>>>
>>>>>>>>>>>>>           and it was the only call to no_op_safepoint_needed() that 
>>>>>>>>>>>>> passed
>>>>>>>>>>>>>           'true'. It's the 'true' parameter that made the comment 
>>>>>>>>>>>>> on L619
>>>>>>>>>>>>>           correct. Why do I say that? Well, the only way we get to the
>>>>>>>>>>>>>           bottom of the "while(true)" loop is if we had a 
>>>>>>>>>>>>> vm_operation to
>>>>>>>>>>>>>           perform. Let's say we had a bunch of no-safepoint 
>>>>>>>>>>>>> vm_operations
>>>>>>>>>>>>>           to perform and we just kept executing them, one after 
>>>>>>>>>>>>> another.
>>>>>>>>>>>>>           While doing this work, if our time between safepoints 
>>>>>>>>>>>>> exceeds
>>>>>>>>>>>>>           GuaranteedSafepointInterval, then this 
>>>>>>>>>>>>> no_op_safepoint_needed(true)
>>>>>>>>>>>>>           call is what would detect that we've gone too long between
>>>>>>>>>>>>> safepoints
>>>>>>>>>>>>>           and would force one.
>>>>>>>>>>>>>
>>>>>>>>>>>>>           With the fix for 8219436, that call was changed to
>>>>>>>>>>>>> no_op_safepoint(false)
>>>>>>>>>>>>>           and with the v2 version of this fix that call is now 
>>>>>>>>>>>>> gone. So we no
>>>>>>>>>>>>>           longer have the ability to have 
>>>>>>>>>>>>> GuaranteedSafepointInterval work
>>>>>>>>>>>>>           right when we are doing lots of non-safepoint VM operations.
>>>>>>>>>>>>>
>>>>>>>>>>>>>           This no_op_safepoint_needed() probe point, either with 
>>>>>>>>>>>>> 'true' or
>>>>>>>>>>>>>           'false', also gave us the opportunity to force a 
>>>>>>>>>>>>> safepoint when
>>>>>>>>>>>>>           SafepointALot is true or when
>>>>>>>>>>>>> SafepointSynchronize::is_cleanup_needed()
>>>>>>>>>>>>>           returns true. In the v2 version of this fix, we lose the 
>>>>>>>>>>>>> ability
>>>>>>>>>>>>>           to SafepointALot after each VM operation. We also lose 
>>>>>>>>>>>>> the ability
>>>>>>>>>>>>>           to do cleanup() after each VM operation.
>>>>>>>>>>>>
>>>>>>>>>>>> And I don't think that is an issue, ICache and Monitor, which is the
>>>>>>>>>>>> cleanup we
>>>>>>>>>>>> do, just have just basic heuristic. If we cleanup every second or 
>>>>>>>>>>>> every other
>>>>>>>>>>>> second makes little difference.
>>>>>>>>>>>
>>>>>>>>>>> My point is that you are changing the semantics of SafepointALot without
>>>>>>>>>>> stating clearly that is what you are doing. The point of 
>>>>>>>>>>> SafepointALot is
>>>>>>>>>>> to inject a safepoint after every VMop even those VMops that have to be
>>>>>>>>>>> executed at a safepoint. SafepointALot is considered a stress option for
>>>>>>>>>>> a very good reason.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> SafepointALot is just diag, which causes safepoint directly after a 
>>>>>>>>>>>> safepoint.
>>>>>>>>>>>> I'd rather have it doing one extra safepoint per 
>>>>>>>>>>>> GuaranteedSafepointInterval,
>>>>>>>>>>>> that way you know how many safepoint you have per second, which is 
>>>>>>>>>>>> really
>>>>>>>>>>>> useful
>>>>>>>>>>>> for benchmarking. In current form it can't be used for benchmarks.
>>>>>>>>>>>
>>>>>>>>>>> You would never use SafepointALot in a benchmark run. It is is
>>>>>>>>>>> stress option.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>           I think you need to restore the "bool check_time" 
>>>>>>>>>>>>> parameter to
>>>>>>>>>>>>>           no_op_safepoint() and you need to restore this call to as:
>>>>>>>>>>>>>           VMThread::no_op_safepoint_needed(true) along with the rest
>>>>>>>>>>>>>           of old L618-L626.
>>>>>>>>>>>>
>>>>>>>>>>>> I don't need the parameter, we can always check the time.
>>>>>>>>>>>> When we timeout on queue wait we will always pass the time check.
>>>>>>>>>>>
>>>>>>>>>>> As I pointed out, if you are doing non-safepoint VM ops, you can miss
>>>>>>>>>>> a time check.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> So the difference is that I think the previous behavior was 
>>>>>>>>>>>> bad/buggy and
>>>>>>>>>>>> you are trying to restore it :)
>>>>>>>>>>>>
>>>>>>>>>>>> Check-out v3, just full:
>>>>>>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v3/webrev/
>>>>>>>>>>>
>>>>>>>>>>> I'll post a second message with my re-review for this webrev.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> - wait on queue GuaranteedSafepointInterval
>>>>>>>>>>>> - if timeout
>>>>>>>>>>>>      - do handshake a lot
>>>>>>>>>>>>      - we are guaranteed to pass the time check, so cleanup if needed
>>>>>>>>>>>>      - else safepoint alot
>>>>>>>>>>>> - if op
>>>>>>>>>>>>      - after op
>>>>>>>>>>>>          - do handshake a lot
>>>>>>>>>>>>          - check time, if expired, cleanup if needed
>>>>>>>>>>>>          - else safepoint alot
>>>>>>>>>>>>
>>>>>>>>>>>> I think that is what you want :)
>>>>>>>>>>>> And that's what in v3.
>>>>>>>>>>>>
>>>>>>>>>>>> I would rather have:
>>>>>>>>>>>> - wait-time = GuaranteedSafepointInterval -
>>>>>>>>>>>> SafepointTracing::time_since_last_safepoint_ms()
>>>>>>>>>>>> - if wait-time > 0
>>>>>>>>>>>>      - wait on queue
>>>>>>>>>>>> - else or timeout
>>>>>>>>>>>>      - do handshake a lot
>>>>>>>>>>>>      - cleanup if needed
>>>>>>>>>>>>      - else safepoint alot
>>>>>>>>>>>>
>>>>>>>>>>>> That way we do check for cleanup after non-safepoint in time but we 
>>>>>>>>>>>> do not
>>>>>>>>>>>> causes extra safepoints after safepoints with SafepointALot. (I know 
>>>>>>>>>>>> you
>>>>>>>>>>>> think of this as feature :) )
>>>>>>>>>>>
>>>>>>>>>>> Not so much a feature as a stress option. I love my stress options.
>>>>>>>>>>>
>>>>>>>>>>> Dan
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks, Robbin
>>>>>>>>>>>
>>>>>>>>>>
>>>>


More information about the hotspot-runtime-dev mailing list