RFR: 8322420: [Linux] cgroup v2: Limits in parent nested control groups are not detected
Severin Gehwolf
sgehwolf at openjdk.org
Tue Aug 20 14:54:18 UTC 2024
Please review this Linux container detection improvement which allows limits being detected even if they are not exposed at the leaf nodes. So far this is only observable on systemd slices on cgroup v2. For cgroup v1 this has been addressed with [JDK-8217338](https://bugs.openjdk.org/browse/JDK-8217338) in a version specific way. This patch proposes to address the problem a different way. Instead of only looking at the determined cgroup path for the interface files, we iterate the hierarchy up to its root and stop as soon as we have determined any limit at a given path since it's best practise to not set any higher limit lower down the hierarchy (except for the default of unset/max).
Consider this subsystem path:
/sys/fs/cgroup/memory/user.slice/user-cg.slice/user-cg-cpu.slice/run-r634adce2617145ea9660623c335cb3db.scope
with a root of `/sys/fs/cgroup/memory` and a cgroup path of `/user.slice/user-cg.slice/user-cg-cpu.slice/run-r634adce2617145ea9660623c335cb3db.scope`. Then prior this patch we only looked at `/sys/fs/cgroup/memory/user.slice/user-cg.slice/user-cg-cpu.slice/run-r634adce2617145ea9660623c335cb3db.scope/memory.max` on cgroup v2 systems for the limit and at `/sys/fs/cgroup/memory/user.slice/user-cg.slice/user-cg-cpu.slice/run-r634adce2617145ea9660623c335cb3db.scope/memory.limit_in_bytes` on cgroup v1 systems. On cgroup v1 we also looked at `/sys/fs/cgroup/memory/user.slice/user-cg.slice/user-cg-cpu.slice/run-r634adce2617145ea9660623c335cb3db.scope/memory.stat` looking for the `hierarchical_memory_limit` key in there if the original look-up of the limit in `memory.limit_in_bytes` file returned no limit. However, the `hierarchical_memory_limit` info is cgroup v1 specific and not present on cg v2's `memory.stat` files.
This patch addresses this problem in a uniform way by walking the cgroup path up to the root looking up any limit, solving the problem that got addressed version specific with JDK-8217338 at the time as well as addressing the problem on cgroup v2. As soon as any limit is being found it uses that path for the specific controller. That is on cg v1 the following series of paths are being looked at in that order (provided there is no limit set, thus processing doesn't stop early):
/sys/fs/cgroup/memory/user.slice/user-cg.slice/user-cg-cpu.slice/run-r634adce2617145ea9660623c335cb3db.scope/memory.limit_in_bytes
/sys/fs/cgroup/memory/user.slice/user-cg.slice/user-cg-cpu.slice/memory.limit_in_bytes
/sys/fs/cgroup/memory/user.slice/user-cg.slice/memory.limit_in_bytes
/sys/fs/cgroup/memory/user.slice/memory.limit_in_bytes
/sys/fs/cgroup/memory/memory.limit_in_bytes
The patch implements this for the memory and cpu controller since Hotspot currently only uses those for its internal structures (memory for heap and cpu for concurrency).
The hierarchy walk is only done on cgroup subsystem initialization assuming that the cgroup path remains constant. What's more, the walk isn't performed on containers (docker, podman, crio) since those set the limits at the leaf node. As a corollary, this fix doesn't change processing when the JDK runs in containers. It changes for systemd-guided setups or for some hand-crafted systems.
Example output for a systemd slice on cgroup v1:
[0.000s][trace][os,container] OSContainer::init: Initializing Container Support
[0.000s][debug][os,container] Detected optional pids controller entry in /proc/cgroups
[0.001s][debug][os,container] Detected cgroups hybrid or legacy hierarchy, using cgroups v1 controllers
[0.001s][trace][os,container] Adjusting controller path for memory: /sys/fs/cgroup/memory/user.slice/user-cg.slice/user-cg-cpu.slice/run-r634adce2617145ea9660623c335cb3db.scope
[0.001s][trace][os,container] Path to /memory.limit_in_bytes is /sys/fs/cgroup/memory/user.slice/user-cg.slice/user-cg-cpu.slice/run-r634adce2617145ea9660623c335cb3db.scope/memory.limit_in_bytes
[0.001s][trace][os,container] Memory Limit is: 9223372036854771712
[0.001s][debug][os,container] container memory limit ignored: 9223372036854771712, using host value 67163885568
[0.001s][trace][os,container] Path to /memory.limit_in_bytes is /sys/fs/cgroup/memory/user.slice/user-cg.slice/user-cg-cpu.slice/memory.limit_in_bytes
[0.001s][trace][os,container] Memory Limit is: 9223372036854771712
[0.001s][debug][os,container] container memory limit ignored: 9223372036854771712, using host value 67163885568
[0.001s][trace][os,container] Path to /memory.limit_in_bytes is /sys/fs/cgroup/memory/user.slice/user-cg.slice/memory.limit_in_bytes
[0.001s][trace][os,container] Memory Limit is: 4294967296
[0.001s][trace][os,container] Adjusted controller path for memory to: /sys/fs/cgroup/memory/user.slice/user-cg.slice
[0.001s][trace][os,container] Adjusting controller path for cpu: /sys/fs/cgroup/cpu,cpuacct/user.slice/user-cg.slice/user-cg-cpu.slice
[0.001s][trace][os,container] Path to /cpu.cfs_quota_us is /sys/fs/cgroup/cpu,cpuacct/user.slice/user-cg.slice/user-cg-cpu.slice/cpu.cfs_quota_us
[0.001s][trace][os,container] CPU Quota is: 600000
[0.001s][trace][os,container] Path to /cpu.cfs_period_us is /sys/fs/cgroup/cpu,cpuacct/user.slice/user-cg.slice/user-cg-cpu.slice/cpu.cfs_period_us
[0.001s][trace][os,container] CPU Period is: 100000
[0.001s][trace][os,container] CPU Quota count based on quota/period: 6
[0.001s][trace][os,container] OSContainer::active_processor_count: 6
[0.001s][trace][os,container] Lowest limit for cpu at leaf: /sys/fs/cgroup/cpu,cpuacct/user.slice/user-cg.slice/user-cg-cpu.slice
Example output when run on a container (podman) on cg v2:
[0.000s][trace][os,container] OSContainer::init: Initializing Container Support
[0.001s][debug][os,container] Detected optional pids controller entry in /proc/cgroups
[0.001s][debug][os,container] Detected cgroups v2 unified hierarchy
[0.001s][debug][os,container] OSContainer::init: is_containerized() = true because all controllers are mounted read-only (container case)
[0.001s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup/cpu.max
[0.001s][trace][os,container] CPU Quota is: 100000
[0.001s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup/cpu.max
[0.001s][trace][os,container] CPU Period is: 100000
[0.001s][trace][os,container] CPU Quota count based on quota/period: 1
[0.001s][trace][os,container] OSContainer::active_processor_count: 1
[0.001s][trace][os,container] CgroupSubsystem::active_processor_count (cached): 1
[0.001s][trace][os,container] total physical memory: 6207631360
[0.001s][trace][os,container] Path to /memory.max is /sys/fs/cgroup/memory.max
[0.001s][trace][os,container] Memory Limit is: 524288000
[0.001s][trace][os,container] Memory Limit is: 524288000
**Testing:**
- [ ] GHA
- [x] Linux container tests in `test/hotspot/jtreg/containers` on cg v1 and cg v2 as well as relevant cgroup gtests.
- [x] Manual testing on systemd slices with cpu/memory limits.
- [x] Specific automated testing as proposed in [JDK-8333446](https://bugs.openjdk.org/browse/JDK-8333446).
Thoughts? Opinions?
-------------
Commit messages:
- Merge branch 'master' into jdk-8322420_cgroup_hierarchy_walk_init
- Remove some duplication
- Fix style
- Merge branch 'master' into jdk-8322420_cgroup_hierarchy_walk_init
- Merge branch 'master' into jdk-8322420_cgroup_hierarchy_walk_init
- Merge branch 'master' into jdk-8322420_cgroup_hierarchy_walk_init
- 8322420: [Linux] cgroup v2: Limits in parent nested control groups are not detected
Changes: https://git.openjdk.org/jdk/pull/20646/files
Webrev: https://webrevs.openjdk.org/?repo=jdk&pr=20646&range=00
Issue: https://bugs.openjdk.org/browse/JDK-8322420
Stats: 363 lines in 9 files changed: 280 ins; 66 del; 17 mod
Patch: https://git.openjdk.org/jdk/pull/20646.diff
Fetch: git fetch https://git.openjdk.org/jdk.git pull/20646/head:pull/20646
PR: https://git.openjdk.org/jdk/pull/20646
More information about the hotspot-runtime-dev
mailing list