RFR: 8205924: ZGC: Premature OOME due to failure to expand backing file
Per Liden
per.liden at oracle.com
Mon Jul 2 15:05:26 UTC 2018
ZGC currently assumes that there will be enough space available on the
backing file system to hold the max heap size (-Xmx). However, this
might not be true. For example, the backing filesystem might have been
misconfigured or space on that filesystem might be used by some other
process. In this situation, ZGC will try (and fail) to map more memory
every time a new page needs to be allocated (assuming that request can't
be satisfied by the page case). As a result, we fail to flush the page
cache, which in turn means we throw a premature OOME and we continuously
take the performance hit by making unnecessary fallocate() syscalls that
will never succeed. We should instead detect this situation, flush the
page cache and avoid making further fallocate() calls.
This issue has been seen now and then in various tests (e.g. RunThese30M
and Kitchensink), typically on machines running older kernels without
support for memfd_create(), where we fall back to using /dev/shm, which
sometimes doesn't have enough space to hold the given max heap size
(default tmpfs size is 50% of the RAM in the machine).
Bug: https://bugs.openjdk.java.net/browse/JDK-8205924
Webrev: http://cr.openjdk.java.net/~pliden/8205924/webrev.0
Testing: Passed two iterations of tier{1,2,3,4,5,6} on linux-x64, passed
multiple iterations of RunThese30M locally, and various manual testing
to provoke the bad situations.
/Per
More information about the hotspot-gc-dev
mailing list