RFR: 8266557: assert(SafepointMechanism::local_poll_armed(_handshakee)) failed: Must be

Mon May 24 10:25:01 UTC 2021

On Mon, 24 May 2021 04:33:18 GMT, David Holmes <dholmes at openjdk.org> wrote:

>> Please consider this change-set which address the issue on hand.
>> 
>> I identified two problems:
>> 
>> - is_locked() uses the _owner field which is unordered (no storestore|storeload) on store-side.
>> Fixed by leaving the handshakee being processed in queue until completed.
>> And remove looping, since if ever the queue is empty the handshakee may processed.
>> If ever want to loop again, we must make sure queue is not empty before removing the processed handshake.
>> But there is, at this moment, no performance benefit to that, so I chosse the simple, easy to reason about version. (some crazy stress test can see a difference)
>> 
>> Note that I'll do a follow-up and make is_locked() ifdef ASSERT only.
>> 
>> - have_non_self_executable_operation() only provide correct acquire if first in queue matched, if second item matched it could be re-orderd with reading the poll state.
>> Fixed by adding a loadload.
>> 
>> I could at first reproduce by checking _active_handshaker in update_poll (~1/50) and an increase in the test time by ten.
>> (real reprod ~1/400 with increased test time)
>> If we are updating the poll there should not be an active handshaker.
>> The above fixed the issue.
>> But after a rebase when I was trying to pin point the issue I could no longer reproduce even without any changes.
>> 
>> Added Atomic store/load to _active_handshaker since it may be concurrently loaded (it may only be used to ask if current thread is active handshaker).
>> 
>> Passes stressing relevant test on aarch64 and t1-7.
>> 
>> Thanks, Robbin
>
> src/hotspot/share/runtime/handshake.cpp line 473:
> 
>> 471:   while (has_operation()) {
>> 472:     MutexLocker ml(&_lock, Mutex::_no_safepoint_check_flag);
>> 473:     HandshakeOperation* op = get_op_for_self();
> 
> I still don't understand why we now do a peek here, and then do the pop below (in remove-op), instead of just doing the pop here?

The reason is, from above:
>When poll is armed the handshakee must go to slow path and lock the HandshakeState lock if:
>- 1: There is handshake operation to be processed
>- 2: Some-else is in the middle of processing a handshake operation for this handshakee

This, in practice, means the first pointer of the queue most be non-NULL, and seen as non-NULL, until after we have processed the handshake operation.
By doing:
1: get op
2: do op
  - When handshakee first checks poll, it is armed.
  - Then it checks global poll (safepoint) and the queue state.
  -  If there is a handshake operation to be processed goto slow path.
3: remove op
  - First ptr can be set NULL, which means the handshakee may elide the slow path.
  - The handshakee sees poll armed, but no global poll and nothing in queue.
  - The handshakee may disarm the poll.

By simply removing the OP after we have executed we know the handshakee must go to slow path.

> src/hotspot/share/runtime/handshake.cpp line 570:
> 
>> 568:   Thread* current_thread = Thread::current();
>> 569: 
>> 570:   HandshakeOperation* op = get_op();
> 
> Again why not pop here?

See above.

If there was only one OP in queue, the handshake may now disarm it's poll and continue.
If there is more in queue it will still go to slow path.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3973