RFR: 8266557: assert(SafepointMechanism::local_poll_armed(_handshakee)) failed: Must be

Mon May 24 12:44:02 UTC 2021

On 24/05/2021 8:25 pm, Robbin Ehn wrote:
> On Mon, 24 May 2021 04:33:18 GMT, David Holmes <dholmes at openjdk.org> wrote:
> 
>>> Please consider this change-set which address the issue on hand.
>>>
>>> I identified two problems:
>>>
>>> - is_locked() uses the _owner field which is unordered (no storestore|storeload) on store-side.
>>> Fixed by leaving the handshakee being processed in queue until completed.
>>> And remove looping, since if ever the queue is empty the handshakee may processed.
>>> If ever want to loop again, we must make sure queue is not empty before removing the processed handshake.
>>> But there is, at this moment, no performance benefit to that, so I chosse the simple, easy to reason about version. (some crazy stress test can see a difference)
>>>
>>> Note that I'll do a follow-up and make is_locked() ifdef ASSERT only.
>>>
>>> - have_non_self_executable_operation() only provide correct acquire if first in queue matched, if second item matched it could be re-orderd with reading the poll state.
>>> Fixed by adding a loadload.
>>>
>>> I could at first reproduce by checking _active_handshaker in update_poll (~1/50) and an increase in the test time by ten.
>>> (real reprod ~1/400 with increased test time)
>>> If we are updating the poll there should not be an active handshaker.
>>> The above fixed the issue.
>>> But after a rebase when I was trying to pin point the issue I could no longer reproduce even without any changes.
>>>
>>> Added Atomic store/load to _active_handshaker since it may be concurrently loaded (it may only be used to ask if current thread is active handshaker).
>>>
>>> Passes stressing relevant test on aarch64 and t1-7.
>>>
>>> Thanks, Robbin
>>
>> src/hotspot/share/runtime/handshake.cpp line 473:
>>
>>> 471:   while (has_operation()) {
>>> 472:     MutexLocker ml(&_lock, Mutex::_no_safepoint_check_flag);
>>> 473:     HandshakeOperation* op = get_op_for_self();
>>
>> I still don't understand why we now do a peek here, and then do the pop below (in remove-op), instead of just doing the pop here?
> 
> The reason is, from above:
>> When poll is armed the handshakee must go to slow path and lock the HandshakeState lock if:
>> - 1: There is handshake operation to be processed
>> - 2: Some-else is in the middle of processing a handshake operation for this handshakee
> 
> This, in practice, means the first pointer of the queue most be non-NULL, and seen as non-NULL, until after we have processed the handshake operation.
> By doing:
> 1: get op
> 2: do op
>    - When handshakee first checks poll, it is armed.
>    - Then it checks global poll (safepoint) and the queue state.
>    -  If there is a handshake operation to be processed goto slow path.
> 3: remove op
>    - First ptr can be set NULL, which means the handshakee may elide the slow path.
>    - The handshakee sees poll armed, but no global poll and nothing in queue.
>    - The handshakee may disarm the poll.
> 
> By simply removing the OP after we have executed we know the handshakee must go to slow path.

What are you calling the "slow path" here?

I think I need a refresher course on the basics of the handshake 
mechanism as I am not understanding this at all sorry.

David
-----

>> src/hotspot/share/runtime/handshake.cpp line 570:
>>
>>> 568:   Thread* current_thread = Thread::current();
>>> 569:
>>> 570:   HandshakeOperation* op = get_op();
>>
>> Again why not pop here?
> 
> See above.
> 
> If there was only one OP in queue, the handshake may now disarm it's poll and continue.
> If there is more in queue it will still go to slow path.
> 
> -------------
> 
> PR: https://git.openjdk.java.net/jdk/pull/3973
>