RFR(s): 8220774: Add HandshakeALot diag option
Robbin Ehn
robbin.ehn at oracle.com
Tue Mar 26 07:00:00 UTC 2019
Hi David,
> This really makes me wonder whether the test truly stresses suspend/resume or
> whether it quickly grinds to a halt till the main thread tells everyone to exit?
That is the reason for skipping two 'last' threads.
In worse case there is always two threads suspending and resuming the others.
We seem to manage something like 800 handshakes during the test on my machine.
So the chances are not to bad to hit a few of the interesting code paths.
Adding Dan's last comment suggestion, are you fine with v6 + that comment?
Thanks, Robbin
>
> Cheers,
> David
>
>> Thumbs up!
>>
>> Dan
>>
>>
>>
>> On 3/25/19 5:09 AM, Robbin Ehn wrote:
>>> Hi David,
>>>
>>> On 3/25/19 2:36 AM, David Holmes wrote:
>>>> Hi Robbin,
>>>>
>>>> On 23/03/2019 1:29 am, Robbin Ehn wrote:
>>>>> Hi David,
>>>>>
>>>>> On 3/22/19 8:05 AM, David Holmes wrote:
>>>>>> Hi Robbin,
>>>>>>
>>>>>> This was a little more complex than I had imagined. :) A couple of comments:
>>>>>>
>>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>>>
>>>>>> 508 // Have to unlock VMOperationQueue_lock just in case
>>>>>> no_op_safepoint()
>>>>>> 509 // has to do a handshake.
>>>>>> 510 MutexUnlockerEx mul(VMOperationQueue_lock,
>>>>>> Mutex::_no_safepoint_check_flag);
>>>>>> 511 if (timedout && (_cur_vm_operation =
>>>>>> VMThread::no_op_safepoint()) != NULL) {
>>>>>>
>>>>>> wouldn't it be better to check timedout first and only then use the unlocker
>>>>>> then check _cur_vm_operation?
>>>>>
>>>>> Fixed.
>>>>>
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>>
>>>>>> Can has_last_Java_frame() and java_call_counter() change values between their
>>>>>> use in the assert and their use in the assert message?
>>>>>
>>>>> No, the JavaThread is safepoint_safe here, thus must have a stable java stack.
>>>>
>>>> Ok.
>>>>
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> Don't you want to add some tests that exercise this? Or update existing tests
>>>>>> that use SafepointALot to also use HandshakeALot?
>>>>>
>>>>> Fixed!
>>>>
>>>> I'm unclear how this fix and the renamed test relate to the existing bug
>>>> JDK-8214174 that was referenced in the ProblemList entry ??
>>>
>>> Since we are not using withebox to walk thread stack, instead using
>>> HandshakeALot I removed Walk from name. The original could cause a circular
>>> suspend issue.
>>>
>>>>
>>>> test/hotspot/jtreg/runtime/handshake/HandshakeSuspendExitTest.java
>>>>
>>>> If you need the loop index use the full form of the for-loop. This:
>>>>
>>>> 43 int i = 0;
>>>> 44 for (Thread thr : _suspend_threads) {
>>>> 45 if (i++ > _suspend_threads.length -2) {
>>>> 46 // Leave last 2 threads running.
>>>> 47 break;
>>>> 48 }
>>>> 49 if (Thread.currentThread() != thr) {
>>>>
>>>> can reduce to simply:
>>>>
>>>> // Leave last 2 threads running.
>>>> for (int i = 0; i < _suspend_threads.length - 2; i++) {
>>>> if (Thread.currentThread() != _suspend_threads[i]) {
>>>>
>>>
>>> Fixed.
>>>
>>>>
>>>> 67 // Wait for all suspend thread starting to loop.
>>>>
>>>> -> // Wait for all suspend-threads to start looping.
>>>>
>>>> 75 exit_threads[i] = new Thread(new Runnable() { public void
>>>> run() {} });
>>>>
>>>> The "new Runnable()..." is unnecessary - "new Thread();" suffices.
>>>>
>>>>
>>>> 79 // Try to suspend them.
>>>> 80 for (Thread thr : exit_threads) {
>>>> 81 thr.suspend();
>>>> 82 }
>>>> 83 for (Thread thr : exit_threads) {
>>>> 84 thr.resume();
>>>> 85 }
>>>>
>>>> there's really no guarantee of getting the suspend during "exit". The
>>>> SuspendAtExit test seems to use logging to make it more likely to encounter
>>>> the suspend when desired. Is there a reason not to add a second @run to that
>>>> test to use HandShakes?
>>>
>>> The original issue I had was so unlikely that I needed tenths of thousands of
>>> thread and looping that for several hours. (that issue is fixed)
>>> So the test just try to quickly with a small chance hit interesting paths.
>>>
>>> We can do that also.
>>>
>>>>
>>>> 90 do {
>>>> 91 for (Thread thr : _suspend_threads) {
>>>> 92 thr.resume();
>>>> 93 }
>>>> 94 while (_sem.tryAcquire()) {
>>>> 95 --waiting;
>>>> 96 }
>>>> 97 } while (waiting > 0);
>>>>
>>>> why do a busy-wait and resume threads you never suspended? Why not just do a
>>>> blocking acquire() on the semaphore? The main suspend/resume logic in the
>>>> _suspend_threads must leave the target thread resumed.
>>>
>>> The _suspend_threads may create a circular suspend, so we resume them just in
>>> case.
>>>
>>>>
>>>>> All handshakes test passed 100 iteration on each platform.
>>>>> But the "8221207: Redo JDK-8218446 - SuspendAtExit hangs" should go in first.
>>>>
>>>> Okay. Just waiting for Dan's stress test results.
>>>
>>> Please see v6:
>>> Inc: http://cr.openjdk.java.net/~rehn/8220774/v6/inc/
>>> Full: http://cr.openjdk.java.net/~rehn/8220774/v6/full/
>>>
>>> Thanks, Robbin
>>>
>>>
>>>>
>>>> Thanks,
>>>> David
>>>> -----
>>>>
>>>>
>>>>> v5 full:
>>>>> http://cr.openjdk.java.net/~rehn/8220774/v5/full/webrev/
>>>>> v5 inc:
>>>>> http://cr.openjdk.java.net/~rehn/8220774/v5/inc/webrev/
>>>>>
>>>>> Thanks, Robbin
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> David
>>>>>>
>>>>>> On 22/03/2019 2:04 am, Robbin Ehn wrote:
>>>>>>> On 2019-03-21 16:11, Daniel D. Daugherty wrote:
>>>>>>>>> Check-out v3, just full:
>>>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v3/webrev/
>>>>>>>>
>>>>>>>> src/hotspot/share/runtime/globals.hpp
>>>>>>>> No comments.
>>>>>>>>
>>>>>>>> src/hotspot/share/runtime/thread.hpp
>>>>>>>> No comments.
>>>>>>>>
>>>>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>>>> L2954: assert((!has_last_Java_frame() && java_call_counter() ==
>>>>>>>> 0) ||
>>>>>>>> L2955: (has_last_Java_frame() && java_call_counter() > 0),
>>>>>>>> "wrong java_sp info!");
>>>>>>>> Perhaps change to this:
>>>>>>>>
>>>>>>>> assert((!has_last_Java_frame() && java_call_counter() ==
>>>>>>>> 0) ||
>>>>>>>> (has_last_Java_frame() && java_call_counter() > 0),
>>>>>>>> "unexpected frame info: has_last_frame=%d,
>>>>>>>> java_call_counter=%d",
>>>>>>>> has_last_Java_frame(), java_call_counter());
>>>>>>>>
>>>>>>>> and consider making the same change to the nmethod version.
>>>>>>>>
>>>>>>>
>>>>>>> Fixed both!
>>>>>>>
>>>>>>>> src/hotspot/share/runtime/vmThread.hpp
>>>>>>>> No comments.
>>>>>>>>
>>>>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>>>>> L449: // Must check for handshakes first, since ops returns.
>>>>>>>> Perhaps:
>>>>>>>> // Check for handshakes first since we may need to return
>>>>>>>> a VMop.
>>>>>>>>
>>>>>>>> L454: // Check for a true cleanup first, trying to keep stats
>>>>>>>> correct.
>>>>>>>> Perhaps:
>>>>>>>> // Check for a cleanup before SafepointALot to keep stats
>>>>>>>> correct.
>>>>>>>>
>>>>>>>> L635: // We want to make sure that we get to a safepoint
>>>>>>>> regularly.
>>>>>>>> Perhaps an addition:
>>>>>>>> // We want to make sure that we get to a safepoint
>>>>>>>> regularly
>>>>>>>> // even when executing VMops that don't require safepoints.
>>>>>>>>
>>>>>>>
>>>>>>> Fixed above!
>>>>>>>
>>>>>>> Publishing v4, since I need a second review:
>>>>>>> Full:
>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v4/webrev/
>>>>>>> Inc:
>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v4/inc/webrev/
>>>>>>>
>>>>>>> Thanks, Robbin
>>>>>>>
>>>>>>>> Okay, I now grok why VMThread::no_op_safepoint() does not need a
>>>>>>>> 'check_time' parameter. Thanks for restoring the no_op_safepoint()
>>>>>>>> call at the bottom of the "while(true)" in VMThread::loop().
>>>>>>>>
>>>>>>>> If you decide to tweak the semantics for SafepointALot down the road,
>>>>>>>> it would be best to do that in its own bug rather than as part of
>>>>>>>> another bug.
>>>>>>>>
>>>>>>>> Thumbs up! Your call on whether to tweak the assert or change the comments.
>>>>>>>>
>>>>>>>> Dan
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 3/21/19 10:28 AM, Daniel D. Daugherty wrote:
>>>>>>>>> On 3/21/19 9:57 AM, Robbin Ehn wrote:
>>>>>>>>>> Hi Dan,
>>>>>>>>>>
>>>>>>>>>> On 2019-03-20 21:21, Daniel D. Daugherty wrote:
>>>>>>>>>>>> And here is v2 for you to consider:
>>>>>>>>>>>> http://rehn-ws.se.oracle.com/cr_mirror/8220774/v2/webrev/
>>>>>>>>>>>
>>>>>>>>>>> It's too difficult to craft my comments relative to the incremental
>>>>>>>>>>> webrev so I'm working with the full webrev still.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> src/hotspot/share/runtime/globals.hpp
>>>>>>>>>>> No comments.
>>>>>>>>>>>
>>>>>>>>>>> src/hotspot/share/runtime/thread.hpp
>>>>>>>>>>> No comments.
>>>>>>>>>>>
>>>>>>>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>>>>>>> L2953: // This checks that the thread have a correct frame state
>>>>>>>>>>> during a handshake
>>>>>>>>>>> typo: s/have/has/
>>>>>>>>>>> Also please add a '.' to end of the sentence.
>>>>>>>>>>
>>>>>>>>>> Fixed.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> L2954: assert((!has_last_Java_frame() && java_call_counter()
>>>>>>>>>>> == 0) ||
>>>>>>>>>>> L2955: (has_last_Java_frame() && java_call_counter() >
>>>>>>>>>>> 0),
>>>>>>>>>>> "wrong java_sp info!");
>>>>>>>>>>> Trying to make sure I understand what you are asserting:
>>>>>>>>>>>
>>>>>>>>>>> 1) if the thread does not have a last Java frame then it
>>>>>>>>>>> must have
>>>>>>>>>>> never called Java
>>>>>>>>>>> 2) if the thread has a last Java frame, then it must have
>>>>>>>>>>> called
>>>>>>>>>>> Java at least once
>>>>>>>>>>>
>>>>>>>>>>> I'm good with the second expression. I have some vague
>>>>>>>>>>> doubts about
>>>>>>>>>>> first expression. When a JavaThread is done executing Java
>>>>>>>>>>> code and
>>>>>>>>>>> it is on its way toward a thread exit, is there ever a time
>>>>>>>>>>> when it
>>>>>>>>>>> no longer has a last Java frame? I'm thinking a handshake
>>>>>>>>>>> late in
>>>>>>>>>>> the JavaThread's life...
>>>>>>>>>>
>>>>>>>>>> The assert is a copy paste from
>>>>>>>>>> L2959 void JavaThread::nmethods_do(CodeBlobClosure* cf) {
>>>>>>>>>>
>>>>>>>>>> I have seen this assert trigger twice on windows, either it's wrong
>>>>>>>>>> and should
>>>>>>>>>> be removed from nmethods_do, or it is correct and this will help us
>>>>>>>>>> find when
>>>>>>>>>> this happens.
>>>>>>>>>
>>>>>>>>> Discussed this with Robbin via chat. I'm good with leaving
>>>>>>>>> the assert in place. Perhaps tweak both like this:
>>>>>>>>>
>>>>>>>>> assert((!has_last_Java_frame() && java_call_counter() == 0) ||
>>>>>>>>> (has_last_Java_frame() && java_call_counter() > 0),
>>>>>>>>> "unexpected frame info: has_last_frame=%d,
>>>>>>>>> java_call_counter=%d",
>>>>>>>>> has_last_Java_frame(), java_call_counter());
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> src/hotspot/share/runtime/vmThread.hpp
>>>>>>>>>>> No comments.
>>>>>>>>>>>
>>>>>>>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>>>>>>>> L503: MutexUnlockerEx mul(VMOperationQueue_lock,
>>>>>>>>>>> Mutex::_no_safepoint_check_flag);
>>>>>>>>>>> Please add something like this above L503:
>>>>>>>>>>>
>>>>>>>>>>> // Have to unlock VMOperationQueue_lock just in
>>>>>>>>>>> case
>>>>>>>>>>> // no_op_safepoint() has to do a handshake.
>>>>>>>>>>>
>>>>>>>>>>> old L619: // We want to make sure that we get to a safepoint
>>>>>>>>>>> regularly.
>>>>>>>>>>> old L620: //
>>>>>>>>>>> old L621: if ((_cur_vm_operation =
>>>>>>>>>>> VMThread::no_op_safepoint(false)) != NULL) {
>>>>>>>>>>> This call to no_op_safepoint() is at the bottom of the
>>>>>>>>>>> "while(true)"
>>>>>>>>>>> loop in VMThread::loop(). Before the fix for 8219436, this
>>>>>>>>>>> line
>>>>>>>>>>> used to be:
>>>>>>>>>>>
>>>>>>>>>>> if (VMThread::no_op_safepoint_needed(true)) {
>>>>>>>>>>>
>>>>>>>>>>> and it was the only call to no_op_safepoint_needed() that
>>>>>>>>>>> passed
>>>>>>>>>>> 'true'. It's the 'true' parameter that made the comment on
>>>>>>>>>>> L619
>>>>>>>>>>> correct. Why do I say that? Well, the only way we get to the
>>>>>>>>>>> bottom of the "while(true)" loop is if we had a
>>>>>>>>>>> vm_operation to
>>>>>>>>>>> perform. Let's say we had a bunch of no-safepoint
>>>>>>>>>>> vm_operations
>>>>>>>>>>> to perform and we just kept executing them, one after another.
>>>>>>>>>>> While doing this work, if our time between safepoints exceeds
>>>>>>>>>>> GuaranteedSafepointInterval, then this
>>>>>>>>>>> no_op_safepoint_needed(true)
>>>>>>>>>>> call is what would detect that we've gone too long between
>>>>>>>>>>> safepoints
>>>>>>>>>>> and would force one.
>>>>>>>>>>>
>>>>>>>>>>> With the fix for 8219436, that call was changed to
>>>>>>>>>>> no_op_safepoint(false)
>>>>>>>>>>> and with the v2 version of this fix that call is now gone.
>>>>>>>>>>> So we no
>>>>>>>>>>> longer have the ability to have GuaranteedSafepointInterval
>>>>>>>>>>> work
>>>>>>>>>>> right when we are doing lots of non-safepoint VM operations.
>>>>>>>>>>>
>>>>>>>>>>> This no_op_safepoint_needed() probe point, either with
>>>>>>>>>>> 'true' or
>>>>>>>>>>> 'false', also gave us the opportunity to force a safepoint
>>>>>>>>>>> when
>>>>>>>>>>> SafepointALot is true or when
>>>>>>>>>>> SafepointSynchronize::is_cleanup_needed()
>>>>>>>>>>> returns true. In the v2 version of this fix, we lose the
>>>>>>>>>>> ability
>>>>>>>>>>> to SafepointALot after each VM operation. We also lose the
>>>>>>>>>>> ability
>>>>>>>>>>> to do cleanup() after each VM operation.
>>>>>>>>>>
>>>>>>>>>> And I don't think that is an issue, ICache and Monitor, which is the
>>>>>>>>>> cleanup we
>>>>>>>>>> do, just have just basic heuristic. If we cleanup every second or
>>>>>>>>>> every other
>>>>>>>>>> second makes little difference.
>>>>>>>>>
>>>>>>>>> My point is that you are changing the semantics of SafepointALot without
>>>>>>>>> stating clearly that is what you are doing. The point of SafepointALot is
>>>>>>>>> to inject a safepoint after every VMop even those VMops that have to be
>>>>>>>>> executed at a safepoint. SafepointALot is considered a stress option for
>>>>>>>>> a very good reason.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> SafepointALot is just diag, which causes safepoint directly after a
>>>>>>>>>> safepoint.
>>>>>>>>>> I'd rather have it doing one extra safepoint per
>>>>>>>>>> GuaranteedSafepointInterval,
>>>>>>>>>> that way you know how many safepoint you have per second, which is really
>>>>>>>>>> useful
>>>>>>>>>> for benchmarking. In current form it can't be used for benchmarks.
>>>>>>>>>
>>>>>>>>> You would never use SafepointALot in a benchmark run. It is is
>>>>>>>>> stress option.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I think you need to restore the "bool check_time" parameter to
>>>>>>>>>>> no_op_safepoint() and you need to restore this call to as:
>>>>>>>>>>> VMThread::no_op_safepoint_needed(true) along with the rest
>>>>>>>>>>> of old L618-L626.
>>>>>>>>>>
>>>>>>>>>> I don't need the parameter, we can always check the time.
>>>>>>>>>> When we timeout on queue wait we will always pass the time check.
>>>>>>>>>
>>>>>>>>> As I pointed out, if you are doing non-safepoint VM ops, you can miss
>>>>>>>>> a time check.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> So the difference is that I think the previous behavior was bad/buggy and
>>>>>>>>>> you are trying to restore it :)
>>>>>>>>>>
>>>>>>>>>> Check-out v3, just full:
>>>>>>>>>> http://cr.openjdk.java.net/~rehn/8220774/v3/webrev/
>>>>>>>>>
>>>>>>>>> I'll post a second message with my re-review for this webrev.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> - wait on queue GuaranteedSafepointInterval
>>>>>>>>>> - if timeout
>>>>>>>>>> - do handshake a lot
>>>>>>>>>> - we are guaranteed to pass the time check, so cleanup if needed
>>>>>>>>>> - else safepoint alot
>>>>>>>>>> - if op
>>>>>>>>>> - after op
>>>>>>>>>> - do handshake a lot
>>>>>>>>>> - check time, if expired, cleanup if needed
>>>>>>>>>> - else safepoint alot
>>>>>>>>>>
>>>>>>>>>> I think that is what you want :)
>>>>>>>>>> And that's what in v3.
>>>>>>>>>>
>>>>>>>>>> I would rather have:
>>>>>>>>>> - wait-time = GuaranteedSafepointInterval -
>>>>>>>>>> SafepointTracing::time_since_last_safepoint_ms()
>>>>>>>>>> - if wait-time > 0
>>>>>>>>>> - wait on queue
>>>>>>>>>> - else or timeout
>>>>>>>>>> - do handshake a lot
>>>>>>>>>> - cleanup if needed
>>>>>>>>>> - else safepoint alot
>>>>>>>>>>
>>>>>>>>>> That way we do check for cleanup after non-safepoint in time but we do
>>>>>>>>>> not
>>>>>>>>>> causes extra safepoints after safepoints with SafepointALot. (I know you
>>>>>>>>>> think of this as feature :) )
>>>>>>>>>
>>>>>>>>> Not so much a feature as a stress option. I love my stress options.
>>>>>>>>>
>>>>>>>>> Dan
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks, Robbin
>>>>>>>>>
>>>>>>>>
>>
More information about the hotspot-runtime-dev
mailing list