Fixing the OOM-during-evac
Roman Kennke
rkennke at redhat.com
Wed Feb 28 10:42:10 UTC 2018
This issue keeps haunting me. :-)
Over coffee, I had an idea how to solve it. Let me outline it and open
for discussion.
The issue is that when a Java thread hits OOM while in the
write-barrier, another thread (Java or GC) may still succeed to
evacuate the object. This is racy, because thread#1 may get a
from-space copy and write to this, while other threads may get a
to-space copy and write to that.
We need to prevent any other thread from evacuating our failed-to-evac
object, or else safely get the other copy.
My idea is to have a counter for number of threads in the evacuation
path, and as soon as we hit OOM there, wait until the counter drops to
zero, at which point we can be sure to not get the object evacuated
under our feet.
We need to protect the evacuation path with the following protocol.
'The evacuation path' is the code around actual evacuation, i.e.
inside the evac-in-progress- and cset-checks, but around the actual
evac. This needs to be done both in fast- and slow-path.
There is a global counter that shows the number of threads inside the
evac-path, OR a special value (e.g. something negative) to indicate
OOM-during-evac (i.e. no threads are allowed to enter the path).
Upon entry of the evac-path, any threads will attempt to increase the
counter, using a CAS. Depending on the result of the CAS:
- success: carry on with evac
- failure:
- if offending value is a valid counter, then try again
- if offending value is OOM-during-evac special value: loop until
counter drops to 0, then exit with read-barrier
Upon exit, any threads will decrease the counter using atomic dec.
Upon OOM-during-evac, any thread will attempt to CAS OOM-during-evac
special value into the counter. Depending on result:
- success: busy-loop until counter drops to zero, then exit with RB
- failure:
- offender is valid counter update: try again
- offender is OOM-during-evac: busy loop until counter drops to
zero, then exit with RB
For Java threads, this protocol needs to be done in the fast
(assembly) path too, because they can cause evacs. Or else, we could
decide to disable the fast-path altogether (I was never really sure if
the extra assembly did us much good).
GC threads don't have to protect every single evacuation, but can
instead do the protocol wholesale: i.e. enter on worker start, and
exit on worker done.
Please think hard about this possible solutions and try to poke holes
into it. Meanwhile, I'll come up with a prototype.
Cheers, Roman
More information about the shenandoah-dev
mailing list