RFR: 8322420: [Linux] cgroup v2: Limits in parent nested control groups are not detected [v2]
Ashutosh Mehra
asmehra at openjdk.org
Wed Sep 4 18:58:24 UTC 2024
On Wed, 4 Sep 2024 10:15:31 GMT, Severin Gehwolf <sgehwolf at openjdk.org> wrote:
>>> > I am not aware of actual scenarios like that, nor I see much reason to do something like that, even though possible. :)
>>>
>>> Having slept over this, I now feel we'd regress cgv1 if we don't do it. Given your suggested test, I'll play with covering this case and adding that to the testing framework in #19530. Then based on the additional complexity we can decide whether or not to cover it or not.
>>
>> https://github.com/openjdk/jdk/pull/20646/commits/0e17c9914f187d9ee1af32674f03b06e8173bc18 implements this. PTAL.
>
>> @jerboaa IIUC this patch sets the memory subsystem path to correspond to the cgroup with the lowest memory limit in the hierarchy. One implication of this approach is that all the memory stats would now be retrieved from that cgroup. So now the swap limit will also be read from that cgroup even though the JVM's cgroup may have a lower value. Is this the desired behavior?
>
> @ashu-mehra You mean some setup like this (assume cgroup v2)?
>
> 1. `foo.slice.d/memor-limit.conf` has `MemoryMax=200M`
> 2. `foo-mem.slice` has `MemoryMax=400M` *and* `MemorySwapMax=0`
>
> That is, we have a memory controller path to `/sys/fs/cgroup/foo.slice` since the lowest memory limit is `200M`, yet `/sys/fs/cgroup/foo.slice/memory.swap.max` would be `max` and `/sys/fs/cgroup/foo.slice/foo-mem.slice/memory.swap.max` would be `0`. I.e. the kernel doesn't allow swap, while the JVM thinks it has unlimited swap. Yes, that's possible and a good thought experiment. So thanks!
>
> With that said, this doesn't seem to be a problem for the common k8s container case where we have a much simpler setup. There, the path to the memory limit and swap limit is always the same (at least AFAIK). Also, as far as the JVM is concerned it uses `os::physical_memory()` for sizing the heap (it's not using swap metrics for anything significant). `os::physical_memory()` in-turn would use the cgroups code to figure out the memory limit and use that over the host physical memory. I.e. it would still be OK in such a case. The reporting for the swap limit would be incorrect, though.
>
> Same for the CPU count code which is used by various bits by the JVM (GC, JIT, etc.). The symmetry there doesn't exist. At least I don't see any other - non-critical - metric we expose using the adjusted cpu path.
>
> IMO the important bits are the memory limit and the cpu count. The rest isn't that critical. So should we do something about it? Possibly. Happy to file a bug with this limitation documented. We could then address it if it turns out to become a real problem. Right, now, it strikes me as unusual enough for it to be OK as it is. What do you think?
>
> A possible fix would be to always walk the full hierarchy for any metric we look up, but I have the feeling this is You Ain't Gonna Need It (YAGNI) enough to not implement it right away. FWIW, I wasn't able to re-create such a config on cg v1 exposing this swap issue. Were you?
@jerboaa I agree it would probably just result in incorrect reporting of available swap space. I guess only JFR is using that stat. I can't really comment on the importance of the swap value for JFR. If we can live with this issue I guess then this scenario can be ignored.
> FWIW, I wasn't able to re-create such a config on cg v1 exposing this swap issue. Were you?
I have cg v2 on my system and I can produce this kind configuration manually:
1. A `test` cgroup with `memory.max` set to ~3g and `memory.swap.max` to `max`
2. A child of `test` called `bar` with `memory.max` set to `max` and and `memory.swap.max` set to 0.
$ pwd
/sys/fs/cgroup/test
$ cat memory.max
3145728000
$ cat memory.swap.max
max
$ cd bar
$ pwd
/sys/fs/cgroup/test/bar
$ cat memory.max
max
$ cat memory.swap.max
0
I don't know if this can be replicated with systemd. I haven't tried that.
I also wonder why don't we cache the memory limits (and other metrics like cpu count) instead of reading from the config file every time?
-------------
PR Comment: https://git.openjdk.org/jdk/pull/20646#issuecomment-2329764806
More information about the hotspot-runtime-dev
mailing list