RFR(s): 8220774: Add HandshakeALot diag option
Robbin Ehn
robbin.ehn at oracle.com
Fri Mar 22 15:29:34 UTC 2019
Hi David,
On 3/22/19 8:05 AM, David Holmes wrote:
> Hi Robbin,
>
> This was a little more complex than I had imagined. :) A couple of comments:
>
> src/hotspot/share/runtime/vmThread.cpp
>
> 508 // Have to unlock VMOperationQueue_lock just in case
> no_op_safepoint()
> 509 // has to do a handshake.
> 510 MutexUnlockerEx mul(VMOperationQueue_lock,
> Mutex::_no_safepoint_check_flag);
> 511 if (timedout && (_cur_vm_operation =
> VMThread::no_op_safepoint()) != NULL) {
>
> wouldn't it be better to check timedout first and only then use the unlocker
> then check _cur_vm_operation?
Fixed.
>
> ---
>
> src/hotspot/share/runtime/thread.cpp
>
> Can has_last_Java_frame() and java_call_counter() change values between their
> use in the assert and their use in the assert message?
No, the JavaThread is safepoint_safe here, thus must have a stable java stack.
>
> ---
>
> Don't you want to add some tests that exercise this? Or update existing tests
> that use SafepointALot to also use HandshakeALot?
Fixed!
All handshakes test passed 100 iteration on each platform.
But the "8221207: Redo JDK-8218446 - SuspendAtExit hangs" should go in first.
v5 full:
http://cr.openjdk.java.net/~rehn/8220774/v5/full/webrev/
v5 inc:
http://cr.openjdk.java.net/~rehn/8220774/v5/inc/webrev/
Thanks, Robbin
>
> Thanks,
> David
>
> On 22/03/2019 2:04 am, Robbin Ehn wrote:
>> On 2019-03-21 16:11, Daniel D. Daugherty wrote:
>>>> Check-out v3, just full:
>>>> http://cr.openjdk.java.net/~rehn/8220774/v3/webrev/
>>>
>>> src/hotspot/share/runtime/globals.hpp
>>> No comments.
>>>
>>> src/hotspot/share/runtime/thread.hpp
>>> No comments.
>>>
>>> src/hotspot/share/runtime/thread.cpp
>>> L2954: assert((!has_last_Java_frame() && java_call_counter() == 0) ||
>>> L2955: (has_last_Java_frame() && java_call_counter() > 0),
>>> "wrong java_sp info!");
>>> Perhaps change to this:
>>>
>>> assert((!has_last_Java_frame() && java_call_counter() == 0) ||
>>> (has_last_Java_frame() && java_call_counter() > 0),
>>> "unexpected frame info: has_last_frame=%d,
>>> java_call_counter=%d",
>>> has_last_Java_frame(), java_call_counter());
>>>
>>> and consider making the same change to the nmethod version.
>>>
>>
>> Fixed both!
>>
>>> src/hotspot/share/runtime/vmThread.hpp
>>> No comments.
>>>
>>> src/hotspot/share/runtime/vmThread.cpp
>>> L449: // Must check for handshakes first, since ops returns.
>>> Perhaps:
>>> // Check for handshakes first since we may need to return a VMop.
>>>
>>> L454: // Check for a true cleanup first, trying to keep stats correct.
>>> Perhaps:
>>> // Check for a cleanup before SafepointALot to keep stats correct.
>>>
>>> L635: // We want to make sure that we get to a safepoint regularly.
>>> Perhaps an addition:
>>> // We want to make sure that we get to a safepoint regularly
>>> // even when executing VMops that don't require safepoints.
>>>
>>
>> Fixed above!
>>
>> Publishing v4, since I need a second review:
>> Full:
>> http://cr.openjdk.java.net/~rehn/8220774/v4/webrev/
>> Inc:
>> http://cr.openjdk.java.net/~rehn/8220774/v4/inc/webrev/
>>
>> Thanks, Robbin
>>
>>> Okay, I now grok why VMThread::no_op_safepoint() does not need a
>>> 'check_time' parameter. Thanks for restoring the no_op_safepoint()
>>> call at the bottom of the "while(true)" in VMThread::loop().
>>>
>>> If you decide to tweak the semantics for SafepointALot down the road,
>>> it would be best to do that in its own bug rather than as part of
>>> another bug.
>>>
>>> Thumbs up! Your call on whether to tweak the assert or change the comments.
>>>
>>> Dan
>>>
>>>
>>>
>>> On 3/21/19 10:28 AM, Daniel D. Daugherty wrote:
>>>> On 3/21/19 9:57 AM, Robbin Ehn wrote:
>>>>> Hi Dan,
>>>>>
>>>>> On 2019-03-20 21:21, Daniel D. Daugherty wrote:
>>>>>>> And here is v2 for you to consider:
>>>>>>> http://rehn-ws.se.oracle.com/cr_mirror/8220774/v2/webrev/
>>>>>>
>>>>>> It's too difficult to craft my comments relative to the incremental
>>>>>> webrev so I'm working with the full webrev still.
>>>>>>
>>>>>>
>>>>>> src/hotspot/share/runtime/globals.hpp
>>>>>> No comments.
>>>>>>
>>>>>> src/hotspot/share/runtime/thread.hpp
>>>>>> No comments.
>>>>>>
>>>>>> src/hotspot/share/runtime/thread.cpp
>>>>>> L2953: // This checks that the thread have a correct frame state
>>>>>> during a handshake
>>>>>> typo: s/have/has/
>>>>>> Also please add a '.' to end of the sentence.
>>>>>
>>>>> Fixed.
>>>>>
>>>>>>
>>>>>> L2954: assert((!has_last_Java_frame() && java_call_counter() == 0) ||
>>>>>> L2955: (has_last_Java_frame() && java_call_counter() > 0),
>>>>>> "wrong java_sp info!");
>>>>>> Trying to make sure I understand what you are asserting:
>>>>>>
>>>>>> 1) if the thread does not have a last Java frame then it must have
>>>>>> never called Java
>>>>>> 2) if the thread has a last Java frame, then it must have called
>>>>>> Java at least once
>>>>>>
>>>>>> I'm good with the second expression. I have some vague doubts about
>>>>>> first expression. When a JavaThread is done executing Java code and
>>>>>> it is on its way toward a thread exit, is there ever a time when it
>>>>>> no longer has a last Java frame? I'm thinking a handshake late in
>>>>>> the JavaThread's life...
>>>>>
>>>>> The assert is a copy paste from
>>>>> L2959 void JavaThread::nmethods_do(CodeBlobClosure* cf) {
>>>>>
>>>>> I have seen this assert trigger twice on windows, either it's wrong and should
>>>>> be removed from nmethods_do, or it is correct and this will help us find when
>>>>> this happens.
>>>>
>>>> Discussed this with Robbin via chat. I'm good with leaving
>>>> the assert in place. Perhaps tweak both like this:
>>>>
>>>> assert((!has_last_Java_frame() && java_call_counter() == 0) ||
>>>> (has_last_Java_frame() && java_call_counter() > 0),
>>>> "unexpected frame info: has_last_frame=%d, java_call_counter=%d",
>>>> has_last_Java_frame(), java_call_counter());
>>>>
>>>>>
>>>>>>
>>>>>> src/hotspot/share/runtime/vmThread.hpp
>>>>>> No comments.
>>>>>>
>>>>>> src/hotspot/share/runtime/vmThread.cpp
>>>>>> L503: MutexUnlockerEx mul(VMOperationQueue_lock,
>>>>>> Mutex::_no_safepoint_check_flag);
>>>>>> Please add something like this above L503:
>>>>>>
>>>>>> // Have to unlock VMOperationQueue_lock just in case
>>>>>> // no_op_safepoint() has to do a handshake.
>>>>>>
>>>>>> old L619: // We want to make sure that we get to a safepoint
>>>>>> regularly.
>>>>>> old L620: //
>>>>>> old L621: if ((_cur_vm_operation =
>>>>>> VMThread::no_op_safepoint(false)) != NULL) {
>>>>>> This call to no_op_safepoint() is at the bottom of the "while(true)"
>>>>>> loop in VMThread::loop(). Before the fix for 8219436, this line
>>>>>> used to be:
>>>>>>
>>>>>> if (VMThread::no_op_safepoint_needed(true)) {
>>>>>>
>>>>>> and it was the only call to no_op_safepoint_needed() that passed
>>>>>> 'true'. It's the 'true' parameter that made the comment on L619
>>>>>> correct. Why do I say that? Well, the only way we get to the
>>>>>> bottom of the "while(true)" loop is if we had a vm_operation to
>>>>>> perform. Let's say we had a bunch of no-safepoint vm_operations
>>>>>> to perform and we just kept executing them, one after another.
>>>>>> While doing this work, if our time between safepoints exceeds
>>>>>> GuaranteedSafepointInterval, then this no_op_safepoint_needed(true)
>>>>>> call is what would detect that we've gone too long between
>>>>>> safepoints
>>>>>> and would force one.
>>>>>>
>>>>>> With the fix for 8219436, that call was changed to
>>>>>> no_op_safepoint(false)
>>>>>> and with the v2 version of this fix that call is now gone. So we no
>>>>>> longer have the ability to have GuaranteedSafepointInterval work
>>>>>> right when we are doing lots of non-safepoint VM operations.
>>>>>>
>>>>>> This no_op_safepoint_needed() probe point, either with 'true' or
>>>>>> 'false', also gave us the opportunity to force a safepoint when
>>>>>> SafepointALot is true or when
>>>>>> SafepointSynchronize::is_cleanup_needed()
>>>>>> returns true. In the v2 version of this fix, we lose the ability
>>>>>> to SafepointALot after each VM operation. We also lose the ability
>>>>>> to do cleanup() after each VM operation.
>>>>>
>>>>> And I don't think that is an issue, ICache and Monitor, which is the
>>>>> cleanup we
>>>>> do, just have just basic heuristic. If we cleanup every second or every other
>>>>> second makes little difference.
>>>>
>>>> My point is that you are changing the semantics of SafepointALot without
>>>> stating clearly that is what you are doing. The point of SafepointALot is
>>>> to inject a safepoint after every VMop even those VMops that have to be
>>>> executed at a safepoint. SafepointALot is considered a stress option for
>>>> a very good reason.
>>>>
>>>>
>>>>> SafepointALot is just diag, which causes safepoint directly after a safepoint.
>>>>> I'd rather have it doing one extra safepoint per GuaranteedSafepointInterval,
>>>>> that way you know how many safepoint you have per second, which is really
>>>>> useful
>>>>> for benchmarking. In current form it can't be used for benchmarks.
>>>>
>>>> You would never use SafepointALot in a benchmark run. It is is
>>>> stress option.
>>>>
>>>>>
>>>>>>
>>>>>> I think you need to restore the "bool check_time" parameter to
>>>>>> no_op_safepoint() and you need to restore this call to as:
>>>>>> VMThread::no_op_safepoint_needed(true) along with the rest
>>>>>> of old L618-L626.
>>>>>
>>>>> I don't need the parameter, we can always check the time.
>>>>> When we timeout on queue wait we will always pass the time check.
>>>>
>>>> As I pointed out, if you are doing non-safepoint VM ops, you can miss
>>>> a time check.
>>>>
>>>>
>>>>> So the difference is that I think the previous behavior was bad/buggy and
>>>>> you are trying to restore it :)
>>>>>
>>>>> Check-out v3, just full:
>>>>> http://cr.openjdk.java.net/~rehn/8220774/v3/webrev/
>>>>
>>>> I'll post a second message with my re-review for this webrev.
>>>>
>>>>
>>>>>
>>>>> - wait on queue GuaranteedSafepointInterval
>>>>> - if timeout
>>>>> - do handshake a lot
>>>>> - we are guaranteed to pass the time check, so cleanup if needed
>>>>> - else safepoint alot
>>>>> - if op
>>>>> - after op
>>>>> - do handshake a lot
>>>>> - check time, if expired, cleanup if needed
>>>>> - else safepoint alot
>>>>>
>>>>> I think that is what you want :)
>>>>> And that's what in v3.
>>>>>
>>>>> I would rather have:
>>>>> - wait-time = GuaranteedSafepointInterval -
>>>>> SafepointTracing::time_since_last_safepoint_ms()
>>>>> - if wait-time > 0
>>>>> - wait on queue
>>>>> - else or timeout
>>>>> - do handshake a lot
>>>>> - cleanup if needed
>>>>> - else safepoint alot
>>>>>
>>>>> That way we do check for cleanup after non-safepoint in time but we do not
>>>>> causes extra safepoints after safepoints with SafepointALot. (I know you
>>>>> think of this as feature :) )
>>>>
>>>> Not so much a feature as a stress option. I love my stress options.
>>>>
>>>> Dan
>>>>
>>>>>
>>>>> Thanks, Robbin
>>>>
>>>
More information about the hotspot-runtime-dev
mailing list