RFR: 8292083: Detected container memory limit may exceed physical machine memory [v19]

Thu Aug 25 17:31:32 UTC 2022

On Wed, 24 Aug 2022 10:35:18 GMT, Jonathan Dowland <jdowland at openjdk.org> wrote:

>> We discovered some systems configured with cgroups v1 which report a bogus container memory limit value which is above the physical memory of the host. OpenJDK then calculates flags such as InitialHeapSize based on this invalid value; this can be larger than the available memory which can result in the OS terminating the process due to OOM.
>> 
>> hotspot's container awareness attempts to sanity check the limit value by ensuring it's below `_unlimited_memory = (LONG_MAX / os::vm_page_size()) * os::vm_page_size()`, but that still leaves a large range of potential invalid values between physical RAM and that ceiling value.
>> 
>> Cgroups V1 in particular returns an uninitialised value for the memory limit when one has not been explicitly set. Cgroups v2 does not suffer the same problem: however, it's possible for any value to be set for the max memory, including values exceeding the available physical memory, in either v1 or v2.
>> 
>> This fixes the problem in two places. Further work may be required in the area of Java metrics / MXBeans. I'd also look again at whether the existing ceiling value `_unlimited_memory` serves any useful purpose. I personally don't feel those improvements should hold up this fix.
>
> Jonathan Dowland has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Address style nit

So, before we had:

os::physical_memory()  
        |-----------------------------------------
        |                                        |
        v                                        v
OSContainer::memory_limit_in_bytes()    os::Linux::physical_memory()

Made sense if one thinks of the `os` layer as the arbiter that has to lie to the caller to abstract aways platform stuff, and the `Linux` layer as the real truth source.

Similar for available memory:

os::available_memory()
        |
        v
os::Linux::available_memory()
        |
        |-----------------------------------------
        |                                        |
        v                                        v
OSContainer::memory_limit_in_bytes()    or returns host available memory

Already a bit crooked, since here the splicing is done at the `::Linux` layer.

With this patch we have:

os::physical_memory()  
        |
        |-----------------------------------------
        |                                        |
        v                                        v
OSContainer::memory_limit_in_bytes()    os::Linux::physical_memory()
        |
        |-----------------------------------------
        |                                        |
        v                                        v
os::Linux::physical_memory()            or returns cgroup limit

and

os::available_memory()
        |
        v
os::Linux::available_memory()
        |
        |-----------------------------------------
        |                                        |
        v                                        v
OSContainer::memory_limit_in_bytes()    or returns host physical memory
        |
        |-----------------------------------------
        |                                        |
        v                                        v
os::Linux::physical_memory()            or returns cgroup limit

This is getting a bit hard to understand. The only one behaving as advertised is `os::Linux::physical_memory()`, everyone else is somehow splicing other information in. Of course a lot of this is preexisting and has nothing to do with this patch. But it deepens the confusion.

One thing I don't understand is, why does this calculation have to be done at every invocation to os::available_memory()/os::physical_memory()/OSContainer::memory_limit_in_bytes() etc? Yes, cgroup limits can change, but AFAIK there is nothing in the VM that can react to these changes anyway. Heap geometry etc. gets sorted out at VM start.

So why do we do not just do this:

  // We need to update the amount of physical memory now that
  // cgroup subsystem files have been processed.
-  if ((mem_limit = cgroup_subsystem->memory_limit_in_bytes()) > 0) {
+  size_t mem_limit = cgroup_subsystem->memory_limit_in_bytes();
+  if (mem_limit > 0 && mem_limit < os::Linux::physical_limit()) {
    os::Linux::set_physical_memory(mem_limit);
    log_info(os, container)("Memory Limit is: " JLONG_FORMAT, mem_limit);
  }

and be done with it?

-------------

PR: https://git.openjdk.org/jdk/pull/9880