RFR: 8244340: Handshake processing thread lacks yielding

Wed May 6 08:43:59 UTC 2020

Hi David,

On 2020-05-06 00:26, David Holmes wrote:
> Hi Robbin,
> 
> Sorry that I didn't get a chance to look at this earlier.
> 
> On 5/05/2020 11:08 pm, Robbin Ehn wrote:
>> Hi all, please review.
>>
>> As the issue describes there are problems with this infinitely spinning.
>> To avoid any performance regression we do not yield while we do or see
>> progress. But when we enter a period of no progression we 'quickly' give
>> up.
> 
> I must admit I'm having trouble understanding the use of the 
> HandshakeSpinYield utility class versus the original trivial suggestion 
> of adding:
> 
> if (!os::is_MP() && !_op->is_completed()) {
>    os::naked_yield();
> }

This only works on UP, it's possible to trigger the pathologically case
no MP also. When running short benchmarks without warmup (since VM do
many handshakes early on) I see a speed up on with ~16% on UP and 10% on
two CPUs, on 4 CPUs it's harder to notice any difference.

Also, as I said, running specJVM2008 serial benchmark with ZGC 
(32-cores) handshakes time to completion seem to get 4x lower in that setup.
So this can significantly increase the handshake performance.

> 
> I can't quite get my head around the complexity of it.  Can you explain 
> the basic principle of operation in a loop like this:
> 
>   204     HandshakeSpinYield cbo(start_time_ns);
>   205     do {
>   206       if (handshake_has_timed_out(start_time_ns)) {
>   207         handle_timeout();
>   208       }
>   209       HandshakeState::ProcessResult pr = 
> _target->handshake_try_process(_op);
>   210       cbo.add_result(pr);
>   211       cbo.process();
>   212     } while (!_op->is_completed());

The problems that this heuristics tries to handle is not to yield when
we are close to finished or if we can execute handshake for the target
and thus get closer to finishing.
This is single target case which is not so interesting but same
principle apply.

So if handshake_try_process() fails for different reason, the thread(s)
must have been on proc. Which means it's likely still on proc, but the
further away from last change in return value from
handshake_try_process() the more unlikely it is that it's on proc.
Note: "on proc" above I include things like huge array copy (which lacks 
polls) and thus produce same result.

Therefore we use the time we last saw a result change and start yielding
when the result is stale for some time.

Heuristics translated to a spinlock (single target thread case), which 
protect a very small operation.
The spinlock have the owner as lock key, using a normal spin yield, you
will end-up in situations where the owner just got change from A->B (as
seen from current thread) but according to your spin yield heuristic
it's now time sleep for a one millesecond. This is a very bad decision,
since B is on proc, very unlikely to get off proc in the small critical 
section, and thus most likely unlock any movement! But when B have held
the lock for prolonged time that is a great time to yield!

With many targeted threads for handshakes, this is like having one 
spinlock for each thread. As long as we see owner changes for those 
spinlocks we have not yet acquired it's not a good time to yield.
But when the owners get stale for the remaining locks to acquire
yielding is our best bet.

> 
> I'm not clear on the role of the _result_count 2D array and exactly what 
> state changes we are trying to track here. Do we need an explicitl 
> add_result versus passing pr into process()?

Because you first add all results for all your threads and then 
processes that.

This is the interesting use-case:
       for (JavaThread *thr = jtiwh.next(); thr != NULL; thr = 
jtiwh.next()) {
         // A new thread on the ThreadsList will not have an operation,
         // hence it is skipped in handshake_try_process.
         HandshakeState::ProcessResult pr = thr->handshake_try_process(_op);
         if (pr == HandshakeState::_success) {
           handshake_executed_by_vm_thread++;
         }
         cbo.add_result(pr);
       }
       cbo.process();

> 
> And why do we not check pr (or even _op->is_completed()) immediately so 
> that we don't make the calls into the HSY unnecessarily?

Because that is a microopimization, the array is on thread stack, very 
short and should be unrolled. Avoiding a inc and four cmp for cannot be 
measured.

> 
> We already have a number of spin/backoff strategies implemented across 
> the VM so it would be good to do some consolidation in this area in the 
> future if possible. At a minimum we should be employing the same basic 
> spin-then-yield strategies.

The problem with our spin-then-yield strategies is that they can become 
LIFO's since the thread that first tried will be the most punished.
This behavior is very often not wanted.

> 
>> Issue:
>> https://bugs.openjdk.java.net/browse/JDK-8244340
>>
>> Code:
>> http://cr.openjdk.java.net/~rehn/8244340/v1/webrev/
>>
>> In the pathologically case we can easily see 10x improvement starting 
>> the VM.
>> I have found no performance regressions or performance gains in our
>> standard benchmark setups.
>> But I see much better average completion times for handshakes.
>> For example specJVM2008 serial with ZGC shows 4x faster time to
>> completion for handshakes in my setup, but little or no effect on score
>> (ops/m).
> 
> What platforms have you been able to test this on?

Standard benchmarks Linux/Windows/Mac OSX, no differences in score. 
(note that these are not with ZGC, so handshake doesn't really matter much)
With Linux/ZGC everything seems better, but no direct impact on score.

Thanks, Robbin

> 
> Thanks,
> David
> -----
> 
>> Passes t1-3.
>>
>> Thanks, Robbin