RFR(s): 8220774: Add HandshakeALot diag option
David Holmes
david.holmes at oracle.com
Mon Mar 25 22:22:48 UTC 2019
On 25/03/2019 11:06 pm, Robbin Ehn wrote:
> Hi David,
>
> On 3/25/19 1:04 PM, David Holmes wrote:
>> Hi Robbin,
>>
>> Regarding the test and circular suspends ... if the _suspend_threads
>> can end up causing a hang due to circular suspends then having the
>> main thread issue an extra resume may not help because the
>> _suspend_threads could re-enter a circular suspend state after those
>> extra resumes.
>
> That's why we issue as many as needed until semaphore count is correct.
Yeah my bad - sorry. Replied from memory instead of re-reading the code.
David
> /Robbin
>
>
>>
>> David
>>
>> On 25/03/2019 7:09 pm, Robbin Ehn wrote:
>>> Hi David,
>>>
>>> On 3/25/19 2:36 AM, David Holmes wrote:
>>>> Hi Robbin,
>>>>
>>>> On 23/03/2019 1:29 am, Robbin Ehn wrote:
>>>>> Hi David,
>>>>>
>>>>> On 3/22/19 8:05 AM, David Holmes wrote:
>>>>>> Hi Robbin,
>>>>>>
>>>>>> This was a little more complex than I had imagined. :) A couple of
>>>>>> comments:
>>>>>>
>>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>>>
>>>>>> 508 // Have to unlock VMOperationQueue_lock just in case
>>>>>> no_op_safepoint()
>>>>>> 509 // has to do a handshake.
>>>>>> 510 MutexUnlockerEx mul(VMOperationQueue_lock,
>>>>>> Mutex::_no_safepoint_check_flag);
>>>>>> 511 if (timedout && (_cur_vm_operation =
>>>>>> VMThread::no_op_safepoint()) != NULL) {
>>>>>>
>>>>>> wouldn't it be better to check timedout first and only then use
>>>>>> the unlocker
>>>>>> then check _cur_vm_operation?
>>>>>
>>>>> Fixed.
>>>>>
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>>
>>>>>> Can has_last_Java_frame() and java_call_counter() change values
>>>>>> between their
>>>>>> use in the assert and their use in the assert message?
>>>>>
>>>>> No, the JavaThread is safepoint_safe here, thus must have a stable
>>>>> java stack.
>>>>
>>>> Ok.
>>>>
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> Don't you want to add some tests that exercise this? Or update
>>>>>> existing tests
>>>>>> that use SafepointALot to also use HandshakeALot?
>>>>>
>>>>> Fixed!
>>>>
>>>> I'm unclear how this fix and the renamed test relate to the existing
>>>> bug JDK-8214174 that was referenced in the ProblemList entry ??
>>>
>>> Since we are not using withebox to walk thread stack, instead using
>>> HandshakeALot I removed Walk from name. The original could cause a
>>> circular
>>> suspend issue.
>>>
>>>>
>>>> test/hotspot/jtreg/runtime/handshake/HandshakeSuspendExitTest.java
>>>>
>>>> If you need the loop index use the full form of the for-loop. This:
>>>>
>>>> 43 int i = 0;
>>>> 44 for (Thread thr : _suspend_threads) {
>>>> 45 if (i++ > _suspend_threads.length -2) {
>>>> 46 // Leave last 2 threads running.
>>>> 47 break;
>>>> 48 }
>>>> 49 if (Thread.currentThread() != thr) {
>>>>
>>>> can reduce to simply:
>>>>
>>>> // Leave last 2 threads running.
>>>> for (int i = 0; i < _suspend_threads.length - 2; i++) {
>>>> if (Thread.currentThread() != _suspend_threads[i]) {
>>>>
>>>
>>> Fixed.
>>>
>>>>
>>>> 67 // Wait for all suspend thread starting to loop.
>>>>
>>>> -> // Wait for all suspend-threads to start looping.
>>>>
>>>> 75 exit_threads[i] = new Thread(new Runnable() {
>>>> public void run() {} });
>>>>
>>>> The "new Runnable()..." is unnecessary - "new Thread();" suffices.
>>>>
>>>>
>>>> 79 // Try to suspend them.
>>>> 80 for (Thread thr : exit_threads) {
>>>> 81 thr.suspend();
>>>> 82 }
>>>> 83 for (Thread thr : exit_threads) {
>>>> 84 thr.resume();
>>>> 85 }
>>>>
>>>> there's really no guarantee of getting the suspend during "exit".
>>>> The SuspendAtExit test seems to use logging to make it more likely
>>>> to encounter the suspend when desired. Is there a reason not to add
>>>> a second @run to that test to use HandShakes?
>>>
>>> The original issue I had was so unlikely that I needed tenths of
>>> thousands of
>>> thread and looping that for several hours. (that issue is fixed)
>>> So the test just try to quickly with a small chance hit interesting
>>> paths.
>>>
>>> We can do that also.
>>>
>>>>
>>>> 90 do {
>>>> 91 for (Thread thr : _suspend_threads) {
>>>> 92 thr.resume();
>>>> 93 }
>>>> 94 while (_sem.tryAcquire()) {
>>>> 95 --waiting;
>>>> 96 }
>>>> 97 } while (waiting > 0);
>>>>
>>>> why do a busy-wait and resume threads you never suspended? Why not
>>>> just do a blocking acquire() on the semaphore? The main
>>>> suspend/resume logic in the _suspend_threads must leave the target
>>>> thread resumed.
>>>
>>> The _suspend_threads may create a circular suspend, so we resume them
>>> just in case.
>>>
>>>>
>>>>> All handshakes test passed 100 iteration on each platform.
>>>>> But the "8221207: Redo JDK-8218446 - SuspendAtExit hangs" should go
>>>>> in first.
>>>>
>>>> Okay. Just waiting for Dan's stress test results.
>>>
>>> Please see v6:
>>> Inc: http://cr.openjdk.java.net/~rehn/8220774/v6/inc/
>>> Full: http://cr.openjdk.java.net/~rehn/8220774/v6/full/
>>>
>>> Thanks, Robbin
>>>
>>>
>>>>
>>>> Thanks,
>>>> David
>>>> -----
>>>>
>>>>
>>>>> v5 full:
>>>>> http://cr.openjdk.java.net/~rehn/8220774/v5/full/webrev/
>>>>> v5 inc:
>>>>> http://cr.openjdk.java.net/~rehn/8220774/v5/inc/webrev/
>>>>>
>>>>> Thanks, Robbin
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> David
>>>>>>
>>>>>> On 22/03/2019 2:04 am, Robbin Ehn wrote:
>>>>>>> On 2019-03-21 16:11, Daniel D. Daugherty wrote:
>>>>>>>>> Check-out v3, just full:
>>>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v3/webrev/
>>>>>>>>
>>>>>>>> src/hotspot/share/runtime/globals.hpp
>>>>>>>> No comments.
>>>>>>>>
>>>>>>>> src/hotspot/share/runtime/thread.hpp
>>>>>>>> No comments.
>>>>>>>>
>>>>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>>>> L2954: assert((!has_last_Java_frame() &&
>>>>>>>> java_call_counter() == 0) ||
>>>>>>>> L2955: (has_last_Java_frame() &&
>>>>>>>> java_call_counter() > 0),
>>>>>>>> "wrong java_sp info!");
>>>>>>>> Perhaps change to this:
>>>>>>>>
>>>>>>>> assert((!has_last_Java_frame() &&
>>>>>>>> java_call_counter() == 0) ||
>>>>>>>> (has_last_Java_frame() &&
>>>>>>>> java_call_counter() > 0),
>>>>>>>> "unexpected frame info: has_last_frame=%d,
>>>>>>>> java_call_counter=%d",
>>>>>>>> has_last_Java_frame(), java_call_counter());
>>>>>>>>
>>>>>>>> and consider making the same change to the nmethod
>>>>>>>> version.
>>>>>>>>
>>>>>>>
>>>>>>> Fixed both!
>>>>>>>
>>>>>>>> src/hotspot/share/runtime/vmThread.hpp
>>>>>>>> No comments.
>>>>>>>>
>>>>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>>>>> L449: // Must check for handshakes first, since ops
>>>>>>>> returns.
>>>>>>>> Perhaps:
>>>>>>>> // Check for handshakes first since we may need to
>>>>>>>> return a VMop.
>>>>>>>>
>>>>>>>> L454: // Check for a true cleanup first, trying to keep
>>>>>>>> stats correct.
>>>>>>>> Perhaps:
>>>>>>>> // Check for a cleanup before SafepointALot to
>>>>>>>> keep stats correct.
>>>>>>>>
>>>>>>>> L635: // We want to make sure that we get to a
>>>>>>>> safepoint regularly.
>>>>>>>> Perhaps an addition:
>>>>>>>> // We want to make sure that we get to a
>>>>>>>> safepoint regularly
>>>>>>>> // even when executing VMops that don't require
>>>>>>>> safepoints.
>>>>>>>>
>>>>>>>
>>>>>>> Fixed above!
>>>>>>>
>>>>>>> Publishing v4, since I need a second review:
>>>>>>> Full:
>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v4/webrev/
>>>>>>> Inc:
>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v4/inc/webrev/
>>>>>>>
>>>>>>> Thanks, Robbin
>>>>>>>
>>>>>>>> Okay, I now grok why VMThread::no_op_safepoint() does not need a
>>>>>>>> 'check_time' parameter. Thanks for restoring the no_op_safepoint()
>>>>>>>> call at the bottom of the "while(true)" in VMThread::loop().
>>>>>>>>
>>>>>>>> If you decide to tweak the semantics for SafepointALot down the
>>>>>>>> road,
>>>>>>>> it would be best to do that in its own bug rather than as part of
>>>>>>>> another bug.
>>>>>>>>
>>>>>>>> Thumbs up! Your call on whether to tweak the assert or change
>>>>>>>> the comments.
>>>>>>>>
>>>>>>>> Dan
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 3/21/19 10:28 AM, Daniel D. Daugherty wrote:
>>>>>>>>> On 3/21/19 9:57 AM, Robbin Ehn wrote:
>>>>>>>>>> Hi Dan,
>>>>>>>>>>
>>>>>>>>>> On 2019-03-20 21:21, Daniel D. Daugherty wrote:
>>>>>>>>>>>> And here is v2 for you to consider:
>>>>>>>>>>>> http://rehn-ws.se.oracle.com/cr_mirror/8220774/v2/webrev/
>>>>>>>>>>>
>>>>>>>>>>> It's too difficult to craft my comments relative to the
>>>>>>>>>>> incremental
>>>>>>>>>>> webrev so I'm working with the full webrev still.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> src/hotspot/share/runtime/globals.hpp
>>>>>>>>>>> No comments.
>>>>>>>>>>>
>>>>>>>>>>> src/hotspot/share/runtime/thread.hpp
>>>>>>>>>>> No comments.
>>>>>>>>>>>
>>>>>>>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>>>>>>> L2953: // This checks that the thread have a correct
>>>>>>>>>>> frame state
>>>>>>>>>>> during a handshake
>>>>>>>>>>> typo: s/have/has/
>>>>>>>>>>> Also please add a '.' to end of the sentence.
>>>>>>>>>>
>>>>>>>>>> Fixed.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> L2954: assert((!has_last_Java_frame() &&
>>>>>>>>>>> java_call_counter() == 0) ||
>>>>>>>>>>> L2955: (has_last_Java_frame() &&
>>>>>>>>>>> java_call_counter() > 0),
>>>>>>>>>>> "wrong java_sp info!");
>>>>>>>>>>> Trying to make sure I understand what you are
>>>>>>>>>>> asserting:
>>>>>>>>>>>
>>>>>>>>>>> 1) if the thread does not have a last Java frame
>>>>>>>>>>> then it must have
>>>>>>>>>>> never called Java
>>>>>>>>>>> 2) if the thread has a last Java frame, then it
>>>>>>>>>>> must have called
>>>>>>>>>>> Java at least once
>>>>>>>>>>>
>>>>>>>>>>> I'm good with the second expression. I have some
>>>>>>>>>>> vague doubts about
>>>>>>>>>>> first expression. When a JavaThread is done
>>>>>>>>>>> executing Java code and
>>>>>>>>>>> it is on its way toward a thread exit, is there
>>>>>>>>>>> ever a time when it
>>>>>>>>>>> no longer has a last Java frame? I'm thinking a
>>>>>>>>>>> handshake late in
>>>>>>>>>>> the JavaThread's life...
>>>>>>>>>>
>>>>>>>>>> The assert is a copy paste from
>>>>>>>>>> L2959 void JavaThread::nmethods_do(CodeBlobClosure* cf) {
>>>>>>>>>>
>>>>>>>>>> I have seen this assert trigger twice on windows, either it's
>>>>>>>>>> wrong and should
>>>>>>>>>> be removed from nmethods_do, or it is correct and this will
>>>>>>>>>> help us find when
>>>>>>>>>> this happens.
>>>>>>>>>
>>>>>>>>> Discussed this with Robbin via chat. I'm good with leaving
>>>>>>>>> the assert in place. Perhaps tweak both like this:
>>>>>>>>>
>>>>>>>>> assert((!has_last_Java_frame() && java_call_counter() ==
>>>>>>>>> 0) ||
>>>>>>>>> (has_last_Java_frame() && java_call_counter() > 0),
>>>>>>>>> "unexpected frame info: has_last_frame=%d,
>>>>>>>>> java_call_counter=%d",
>>>>>>>>> has_last_Java_frame(), java_call_counter());
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> src/hotspot/share/runtime/vmThread.hpp
>>>>>>>>>>> No comments.
>>>>>>>>>>>
>>>>>>>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>>>>>>>> L503: MutexUnlockerEx mul(VMOperationQueue_lock,
>>>>>>>>>>> Mutex::_no_safepoint_check_flag);
>>>>>>>>>>> Please add something like this above L503:
>>>>>>>>>>>
>>>>>>>>>>> // Have to unlock VMOperationQueue_lock
>>>>>>>>>>> just in case
>>>>>>>>>>> // no_op_safepoint() has to do a
>>>>>>>>>>> handshake.
>>>>>>>>>>>
>>>>>>>>>>> old L619: // We want to make sure that we get to a
>>>>>>>>>>> safepoint
>>>>>>>>>>> regularly.
>>>>>>>>>>> old L620: //
>>>>>>>>>>> old L621: if ((_cur_vm_operation =
>>>>>>>>>>> VMThread::no_op_safepoint(false)) != NULL) {
>>>>>>>>>>> This call to no_op_safepoint() is at the bottom of
>>>>>>>>>>> the "while(true)"
>>>>>>>>>>> loop in VMThread::loop(). Before the fix for
>>>>>>>>>>> 8219436, this line
>>>>>>>>>>> used to be:
>>>>>>>>>>>
>>>>>>>>>>> if (VMThread::no_op_safepoint_needed(true)) {
>>>>>>>>>>>
>>>>>>>>>>> and it was the only call to
>>>>>>>>>>> no_op_safepoint_needed() that passed
>>>>>>>>>>> 'true'. It's the 'true' parameter that made the
>>>>>>>>>>> comment on L619
>>>>>>>>>>> correct. Why do I say that? Well, the only way we
>>>>>>>>>>> get to the
>>>>>>>>>>> bottom of the "while(true)" loop is if we had a
>>>>>>>>>>> vm_operation to
>>>>>>>>>>> perform. Let's say we had a bunch of no-safepoint
>>>>>>>>>>> vm_operations
>>>>>>>>>>> to perform and we just kept executing them, one
>>>>>>>>>>> after another.
>>>>>>>>>>> While doing this work, if our time between
>>>>>>>>>>> safepoints exceeds
>>>>>>>>>>> GuaranteedSafepointInterval, then this
>>>>>>>>>>> no_op_safepoint_needed(true)
>>>>>>>>>>> call is what would detect that we've gone too long
>>>>>>>>>>> between
>>>>>>>>>>> safepoints
>>>>>>>>>>> and would force one.
>>>>>>>>>>>
>>>>>>>>>>> With the fix for 8219436, that call was changed to
>>>>>>>>>>> no_op_safepoint(false)
>>>>>>>>>>> and with the v2 version of this fix that call is
>>>>>>>>>>> now gone. So we no
>>>>>>>>>>> longer have the ability to have
>>>>>>>>>>> GuaranteedSafepointInterval work
>>>>>>>>>>> right when we are doing lots of non-safepoint VM
>>>>>>>>>>> operations.
>>>>>>>>>>>
>>>>>>>>>>> This no_op_safepoint_needed() probe point, either
>>>>>>>>>>> with 'true' or
>>>>>>>>>>> 'false', also gave us the opportunity to force a
>>>>>>>>>>> safepoint when
>>>>>>>>>>> SafepointALot is true or when
>>>>>>>>>>> SafepointSynchronize::is_cleanup_needed()
>>>>>>>>>>> returns true. In the v2 version of this fix, we
>>>>>>>>>>> lose the ability
>>>>>>>>>>> to SafepointALot after each VM operation. We also
>>>>>>>>>>> lose the ability
>>>>>>>>>>> to do cleanup() after each VM operation.
>>>>>>>>>>
>>>>>>>>>> And I don't think that is an issue, ICache and Monitor, which
>>>>>>>>>> is the
>>>>>>>>>> cleanup we
>>>>>>>>>> do, just have just basic heuristic. If we cleanup every second
>>>>>>>>>> or every other
>>>>>>>>>> second makes little difference.
>>>>>>>>>
>>>>>>>>> My point is that you are changing the semantics of
>>>>>>>>> SafepointALot without
>>>>>>>>> stating clearly that is what you are doing. The point of
>>>>>>>>> SafepointALot is
>>>>>>>>> to inject a safepoint after every VMop even those VMops that
>>>>>>>>> have to be
>>>>>>>>> executed at a safepoint. SafepointALot is considered a stress
>>>>>>>>> option for
>>>>>>>>> a very good reason.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> SafepointALot is just diag, which causes safepoint directly
>>>>>>>>>> after a safepoint.
>>>>>>>>>> I'd rather have it doing one extra safepoint per
>>>>>>>>>> GuaranteedSafepointInterval,
>>>>>>>>>> that way you know how many safepoint you have per second,
>>>>>>>>>> which is really
>>>>>>>>>> useful
>>>>>>>>>> for benchmarking. In current form it can't be used for
>>>>>>>>>> benchmarks.
>>>>>>>>>
>>>>>>>>> You would never use SafepointALot in a benchmark run. It is is
>>>>>>>>> stress option.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I think you need to restore the "bool check_time"
>>>>>>>>>>> parameter to
>>>>>>>>>>> no_op_safepoint() and you need to restore this call
>>>>>>>>>>> to as:
>>>>>>>>>>> VMThread::no_op_safepoint_needed(true) along with
>>>>>>>>>>> the rest
>>>>>>>>>>> of old L618-L626.
>>>>>>>>>>
>>>>>>>>>> I don't need the parameter, we can always check the time.
>>>>>>>>>> When we timeout on queue wait we will always pass the time check.
>>>>>>>>>
>>>>>>>>> As I pointed out, if you are doing non-safepoint VM ops, you
>>>>>>>>> can miss
>>>>>>>>> a time check.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> So the difference is that I think the previous behavior was
>>>>>>>>>> bad/buggy and
>>>>>>>>>> you are trying to restore it :)
>>>>>>>>>>
>>>>>>>>>> Check-out v3, just full:
>>>>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v3/webrev/
>>>>>>>>>
>>>>>>>>> I'll post a second message with my re-review for this webrev.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> - wait on queue GuaranteedSafepointInterval
>>>>>>>>>> - if timeout
>>>>>>>>>> - do handshake a lot
>>>>>>>>>> - we are guaranteed to pass the time check, so cleanup if
>>>>>>>>>> needed
>>>>>>>>>> - else safepoint alot
>>>>>>>>>> - if op
>>>>>>>>>> - after op
>>>>>>>>>> - do handshake a lot
>>>>>>>>>> - check time, if expired, cleanup if needed
>>>>>>>>>> - else safepoint alot
>>>>>>>>>>
>>>>>>>>>> I think that is what you want :)
>>>>>>>>>> And that's what in v3.
>>>>>>>>>>
>>>>>>>>>> I would rather have:
>>>>>>>>>> - wait-time = GuaranteedSafepointInterval -
>>>>>>>>>> SafepointTracing::time_since_last_safepoint_ms()
>>>>>>>>>> - if wait-time > 0
>>>>>>>>>> - wait on queue
>>>>>>>>>> - else or timeout
>>>>>>>>>> - do handshake a lot
>>>>>>>>>> - cleanup if needed
>>>>>>>>>> - else safepoint alot
>>>>>>>>>>
>>>>>>>>>> That way we do check for cleanup after non-safepoint in time
>>>>>>>>>> but we do not
>>>>>>>>>> causes extra safepoints after safepoints with SafepointALot.
>>>>>>>>>> (I know you
>>>>>>>>>> think of this as feature :) )
>>>>>>>>>
>>>>>>>>> Not so much a feature as a stress option. I love my stress
>>>>>>>>> options.
>>>>>>>>>
>>>>>>>>> Dan
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks, Robbin
>>>>>>>>>
>>>>>>>>
More information about the hotspot-runtime-dev
mailing list