RFR: 8266557: assert(SafepointMechanism::local_poll_armed(_handshakee)) failed: Must be [v3]

Tue Jun 1 19:30:26 UTC 2021

On Fri, 28 May 2021 07:45:33 GMT, Robbin Ehn <rehn at openjdk.org> wrote:

>> Please consider this change-set which address the issue on hand.
>> 
>> I identified two problems:
>> 
>> - is_locked() uses the _owner field which is unordered (no storestore|storeload) on store-side.
>> Fixed by leaving the handshakee being processed in queue until completed.
>> And remove looping, since if ever the queue is empty the handshakee may processed.
>> If ever want to loop again, we must make sure queue is not empty before removing the processed handshake.
>> But there is, at this moment, no performance benefit to that, so I chosse the simple, easy to reason about version. (some crazy stress test can see a difference)
>> 
>> Note that I'll do a follow-up and make is_locked() ifdef ASSERT only.
>> 
>> - have_non_self_executable_operation() only provide correct acquire if first in queue matched, if second item matched it could be re-orderd with reading the poll state.
>> Fixed by adding a loadload.
>> 
>> I could at first reproduce by checking _active_handshaker in update_poll (~1/50) and an increase in the test time by ten.
>> (real reprod ~1/400 with increased test time)
>> If we are updating the poll there should not be an active handshaker.
>> The above fixed the issue.
>> But after a rebase when I was trying to pin point the issue I could no longer reproduce even without any changes.
>> 
>> Added Atomic store/load to _active_handshaker since it may be concurrently loaded (it may only be used to ask if current thread is active handshaker).
>> 
>> Passes stressing relevant test on aarch64 and t1-7.
>> 
>> Thanks, Robbin
>
> Robbin Ehn has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains four additional commits since the last revision:
> 
>  - Merge branch 'master' into handshakee
>  - Small update
>  - Merge branch 'master' into handshakee
>  - Fix

Hi Robbin,

I think your patch looks good and fixes the issue with is_locked(). 
Based on the root cause it seems another simpler solution would have been to add a fence between claim_handshake() and can_process_handshake() to make sure the target sees the write to _owner before we read a handshake safe state. If the issue is that _owner should be only use for debug builds or that the fence should have been in Mutex class and we don't want to affect all other cases, then we could use _active_handshaker instead of _owner. That way we could also keep the while loop in try_process() to process multiple operations at a time, and don't worry about the peek vs pop issue.

Thanks,
Patricio

src/hotspot/share/runtime/handshake.cpp line 330:

> 328:   // closure are visible to the VMThread/Handshaker after it reads
> 329:   // that the operation has completed.
> 330:   Atomic::dec(&_pending_threads);

Why do we need a full fence instead of a release as before?

-------------

PR: https://git.openjdk.java.net/jdk/pull/3973