RFR: 8205924: ZGC: Premature OOME due to failure to expand backing file

Mon Jul 2 15:05:26 UTC 2018

ZGC currently assumes that there will be enough space available on the 
backing file system to hold the max heap size (-Xmx). However, this 
might not be true. For example, the backing filesystem might have been 
misconfigured or space on that filesystem might be used by some other 
process. In this situation, ZGC will try (and fail) to map more memory 
every time a new page needs to be allocated (assuming that request can't 
be satisfied by the page case). As a result, we fail to flush the page 
cache, which in turn means we throw a premature OOME and we continuously 
take the performance hit by making unnecessary fallocate() syscalls that 
will never succeed. We should instead detect this situation, flush the 
page cache and avoid making further fallocate() calls.

This issue has been seen now and then in various tests (e.g. RunThese30M 
and Kitchensink), typically on machines running older kernels without 
support for memfd_create(), where we fall back to using /dev/shm, which 
sometimes doesn't have enough space to hold the given max heap size 
(default tmpfs size is 50% of the RAM in the machine).

Bug: https://bugs.openjdk.java.net/browse/JDK-8205924
Webrev: http://cr.openjdk.java.net/~pliden/8205924/webrev.0

Testing: Passed two iterations of tier{1,2,3,4,5,6} on linux-x64, passed 
multiple iterations of RunThese30M locally, and various manual testing 
to provoke the bad situations.

/Per