Improving the scalability of the evac OOM protocol

Tue Oct 4 12:56:12 UTC 2022

Hi,

I've been running SPECjbb with Shenandoah on some large multi-socket Arm
systems and I noticed the concurrent evacuation OOM protocol is a bit of
a bottleneck.  The problem here is that we have a single variable,
_threads_in_evac, shared between all threads.  To enter the protocol we
do a CAS to increment the counter and to leave we do an atomic
decrement.  For the GC threads this isn't really an issue as they only
enter/leave once per cycle, but Java threads have to enter/leave every
time they help evacuate an object on the load barrier slow path.  This
means _threads_in_evac is very heavily contended and we effectively
serialise Java thread execution through access to this variable: I
counted several million CAS failures per second in
ShenandoahEvacOOMHandler::register_thread() on one Arm N1 system while
running SPECjbb.  This is especially problematic on multi-socket systems
where the communication overhead of the cache coherency protocol can be
high.

I tried fixing this in a fairly simple way by replicating the counter N
times on separate cache lines (N=64, somewhat arbitrarily).  See the
draft patch below:

https://github.com/nick-arm/jdk/commit/ca78e77f0c6

Each thread hashes to a particular counter based on its Thread*.  To
signal an OOM we CAS in OOM_MARKER_MASK on every counter and then in
wait_for_no_evac_threads() we wait for every counter to go to zero (and
also to see OOM_MARKER_MASK set in that counter).  I think this is safe
and race-free based on the fact that, once OOM_MARKER_MASK is set, the
counter can only ever decrease.  So once we've seen a particular counter
go to zero we know that the value will never change except when clear()
is called at a safepoint.  This means we can just iterate over all the
counters, and if we see that they are all zero, then we know no more
threads are inside or can enter the evacuation path.

On a 160-core dual-socket Arm N1 system this improves SPECjbb max-jOPS
by ~8% and critical-jOPS by ~98% (!), averaged over 10 runs.  On a
32-core dual-socket Xeon system I get +0.4% max-jOPS and +43%
critical-jOPS.  There's also some benefit on single-socket systems: with
AWS c7g.16xlarge I see +0.3% max-jOPS and +3% critical-jOPS.

I've also tested SPECjbb on a fastdebug build with
-XX:+ShenandoahOOMDuringEvacALot and didn't see any errors.

I experimented with taking this to its logical conclusion and giving
each thread its own counter in ShenandoahThreadLocalData, but it's
difficult to avoid races with thread creation and this simple approach
seems to give most of the benefit anyway.

Any thoughts on this?

--
Thanks,
Nick