OOM error caused by large array allocation in G1

Sat Nov 18 15:37:04 UTC 2017

Hi,

On Sat, 2017-11-18 at 20:53 +0800, Lijie Xu wrote:
> Hi All,
> I recently encountered an OOM error in a Spark application using G1
> collector. This application launches multiple JVM instances to
> process the large data. Each JVM has 6.5GB heap size and uses G1
> collector. A JVM instance throws an OOM error during allocating a
> large (570MB) array. However, this JVM has about 3GB free heap space
> at that time. After analyzing the application logic, heap usage, and
> GC log, I guess the root cause may be the lack of consecutive space
> for holding this large array in G1. I want to know whether my guess
> is right ...

Very likely. This is a long-standing issue (actually I have once
investigated about it like 10 years ago on a different regional
collector), and given your findings it is very likely you are correct.
The issue also has an extra section in the tuning guide [0].

> ... and why G1 has this defect.

Nobody fixed it yet. :)

Reasons:
- workaround easy and typically "just works".
- no "real world" test setups where fixes could be tested available.
People tend to disappear after getting to know the workaround.
Unfortunately, Apache SPARK which is probably one of the more frequent
environmnet it happens with, but it still does not work on jdk9/10 and
soon 11 yet where development happens.
- it's not very interesting work for many. Not sure why, probably
because it involves implementing and evaluating longer term strategies
in the collector to minimize impact of fragmentation which is a complex
topic (at least if you are not satisfied with the last-ditch brute
force approach).
- there are more problematic issues to deal with that affect more
installations, have test setups, and no or no good workaround.

Actually I have been discussing this with colleagues just last week
again in context of work for students/interns. :)

If you want to look into this there are a bunch of CRs open that you
might want to start with (e.g. [1][2][3]) to get an idea of
possibilities - these CRs do not even mention the one brute force
solution other VMs probably apply in that situation: have the full gc
move large arrays too.

Feel free to start a discussion about this topic either here or
preferably in the hotspot-gc-dev mailing list.

> In the following sections, I will detail the JVM info, application,
> OOM phase, and heap usage. Any suggestions will be appreciated.

Simply either increase the heap size or increase region size via
-XX:HeapRegionSize. I think 16m regions will fix the issue in your case
without any other performance impact, and reduce the amount of
humongous objects significantly.

> [JVM info]
> java version "1.8.0_121"
> Oracle Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

While it won't impact this issue, I recommend updating at least to the
latest 8u release. Not suggesting jdk 9 here because we know that SPARK
does not work there yet.

Thanks,
  Thomas

[0] https://docs.oracle.com/javase/9/gctuning/garbage-first-garbage-col
lector-tuning.htm#GUID-2428DA90-B93D-48E6-B336-A849ADF1C552
[1] https://bugs.openjdk.java.net/browse/JDK-8172713
[2] https://bugs.openjdk.java.net/browse/JDK-8038487
[3] https://bugs.openjdk.java.net/browse/JDK-8173627