RFR: 8322420: [Linux] cgroup v2: Limits in parent nested control groups are not detected [v2]

Severin Gehwolf sgehwolf at openjdk.org
Thu Sep 5 09:09:55 UTC 2024


On Wed, 4 Sep 2024 10:15:31 GMT, Severin Gehwolf <sgehwolf at openjdk.org> wrote:

>>> > I am not aware of actual scenarios like that, nor I see much reason to do something like that, even though possible. :)
>>> 
>>> Having slept over this, I now feel we'd regress cgv1 if we don't do it. Given your suggested test, I'll play with covering this case and adding that to the testing framework in #19530. Then based on the additional complexity we can decide whether or not to cover it or not.
>> 
>> https://github.com/openjdk/jdk/pull/20646/commits/0e17c9914f187d9ee1af32674f03b06e8173bc18 implements this. PTAL.
>
>> @jerboaa IIUC this patch sets the memory subsystem path to correspond to the cgroup with the lowest memory limit in the hierarchy. One implication of this approach is that all the memory stats would now be retrieved from that cgroup. So now the swap limit will also be read from that cgroup even though the JVM's cgroup may have a lower value. Is this the desired behavior?
> 
> @ashu-mehra You mean some setup like this (assume cgroup v2)?
> 
> 1. `foo.slice.d/memor-limit.conf` has `MemoryMax=200M`
> 2. `foo-mem.slice` has `MemoryMax=400M` *and* `MemorySwapMax=0`
> 
> That is, we have a memory controller path to `/sys/fs/cgroup/foo.slice` since the lowest memory limit is `200M`, yet `/sys/fs/cgroup/foo.slice/memory.swap.max` would be `max` and `/sys/fs/cgroup/foo.slice/foo-mem.slice/memory.swap.max` would be `0`. I.e. the kernel doesn't allow swap, while the JVM thinks it has unlimited swap. Yes, that's possible and a good thought experiment. So thanks!
> 
> With that said, this doesn't seem to be a problem for the common k8s container case where we have a much simpler setup. There, the path to the memory limit and swap limit is always the same (at least AFAIK). Also, as far as the JVM is concerned it uses `os::physical_memory()` for sizing the heap (it's not using swap metrics for anything significant). `os::physical_memory()` in-turn would use the cgroups code to figure out the memory limit and use that over the host physical memory. I.e. it would still be OK in such a case. The reporting for the swap limit would be incorrect, though.
> 
> Same for the CPU count code which is used by various bits by the JVM (GC, JIT, etc.). The symmetry there doesn't exist. At least I don't see any other - non-critical - metric we expose using the adjusted cpu path.
> 
> IMO the important bits are the memory limit and the cpu count. The rest isn't that critical. So should we do something about it? Possibly. Happy to file a bug with this limitation documented. We could then address it if it turns out to become a real problem. Right, now, it strikes me as unusual enough for it to be OK as it is. What do you think?
> 
> A possible fix would be to always walk the full hierarchy for any metric we look up, but I have the feeling this is You Ain't Gonna Need It (YAGNI) enough to not implement it right away. FWIW, I wasn't able to re-create such a config on cg v1 exposing this swap issue. Were you?

> @jerboaa I agree it would probably just result in incorrect reporting of available swap space. I guess only JFR is using that stat. I can't really comment on the importance of the swap value for JFR. If we can live with this issue I guess then this scenario can be ignored.

I've filed https://bugs.openjdk.org/browse/JDK-8339593 to summarize this. Right now, I don't think it's critical enough to warrant implementing it in all possible cases. See the bug for details. The goal of this fix would be to prevent spurious OOM kills when the JDK runs in a systemd service with some resource control applied. Which I think this fix covers.

> > FWIW, I wasn't able to re-create such a config on cg v1 exposing this swap issue. Were you?
> 
> I have cg v2 on my system and I can produce this kind configuration manually:
> 
>     1. A `test` cgroup with `memory.max` set to ~3g and `memory.swap.max` to `max`
> 
>     2. A child of `test` called `bar` with `memory.max` set to `max` and and `memory.swap.max` set to 0.
> 
> 
> ```
> $ pwd
> /sys/fs/cgroup/test
> $ cat memory.max
> 3145728000
> $ cat memory.swap.max
> max
> $ cd bar
> $ pwd
> /sys/fs/cgroup/test/bar
> $ cat memory.max
> max
> $ cat memory.swap.max
> 0
> ```
> 
> I don't know if this can be replicated with systemd. I haven't tried that.

It's possible and I [did on cg v2 as mentioned in my previous comment](https://github.com/openjdk/jdk/pull/20646#issuecomment-2328467306). In any case, the cost/benefit calculation to cover all possible configs out there isn't worth it just yet.

> I also wonder why don't we cache the memory limits (and other metrics like cpu count) instead of reading from the config file every time?

It's already there? This has been added with https://bugs.openjdk.org/browse/JDK-8227006 and https://bugs.openjdk.org/browse/JDK-8232207. The current code doing it for CPU is [here](https://github.com/openjdk/jdk/blob/96a0502d624e3eff1b00a7c63e8b3a27870b475e/src/hotspot/os/linux/cgroupSubsystem_linux.cpp#L562-L584) and for memory it's [here](https://github.com/openjdk/jdk/blob/master/src/hotspot/os/linux/cgroupSubsystem_linux.cpp#L595-L607). It's only been done for the core metrics, though.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/20646#issuecomment-2330998530


More information about the hotspot-runtime-dev mailing list