RFR: 8322420: [Linux] cgroup v2: Limits in parent nested control groups are not detected [v2]
Thomas Stuefe
stuefe at openjdk.org
Tue Aug 27 14:31:12 UTC 2024
On Tue, 27 Aug 2024 10:14:15 GMT, Severin Gehwolf <sgehwolf at openjdk.org> wrote:
>> Please review this Linux container detection improvement which allows limits being detected even if they are not exposed at the leaf nodes. So far this is only observable on systemd slices on cgroup v2. For cgroup v1 this has been addressed with [JDK-8217338](https://bugs.openjdk.org/browse/JDK-8217338) in a version specific way. This patch proposes to address the problem a different way. Instead of only looking at the determined cgroup path for the interface files, we iterate the hierarchy up to its root and stop as soon as we have determined any limit at a given path since it's best practise to not set any higher limit lower down the hierarchy (except for the default of unset/max).
>>
>> Consider this subsystem path:
>>
>>
>> /sys/fs/cgroup/memory/user.slice/user-cg.slice/user-cg-cpu.slice/run-r634adce2617145ea9660623c335cb3db.scope
>>
>>
>> with a root of `/sys/fs/cgroup/memory` and a cgroup path of `/user.slice/user-cg.slice/user-cg-cpu.slice/run-r634adce2617145ea9660623c335cb3db.scope`. Then prior this patch we only looked at `/sys/fs/cgroup/memory/user.slice/user-cg.slice/user-cg-cpu.slice/run-r634adce2617145ea9660623c335cb3db.scope/memory.max` on cgroup v2 systems for the limit and at `/sys/fs/cgroup/memory/user.slice/user-cg.slice/user-cg-cpu.slice/run-r634adce2617145ea9660623c335cb3db.scope/memory.limit_in_bytes` on cgroup v1 systems. On cgroup v1 we also looked at `/sys/fs/cgroup/memory/user.slice/user-cg.slice/user-cg-cpu.slice/run-r634adce2617145ea9660623c335cb3db.scope/memory.stat` looking for the `hierarchical_memory_limit` key in there if the original look-up of the limit in `memory.limit_in_bytes` file returned no limit. However, the `hierarchical_memory_limit` info is cgroup v1 specific and not present on cg v2's `memory.stat` files.
>>
>> This patch addresses this problem in a uniform way by walking the cgroup path up to the root looking up any limit, solving the problem that got addressed version specific with JDK-8217338 at the time as well as addressing the problem on cgroup v2. As soon as any limit is being found it uses that path for the specific controller. That is on cg v1 the following series of paths are being looked at in that order (provided there is no limit set, thus processing doesn't stop early):
>>
>>
>> /sys/fs/cgroup/memory/user.slice/user-cg.slice/user-cg-cpu.slice/run-r634adce2617145ea9660623c335cb3db.scope/memory.limit_in_bytes
>> /sys/fs/cgroup/memory/user.slice/user-cg.slice/user-cg-cpu.slice/memory.limit_in_bytes
>> /sy...
>
> Severin Gehwolf has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits:
>
> - Merge branch 'master' into jdk-8322420_cgroup_hierarchy_walk_init
> - Merge branch 'master' into jdk-8322420_cgroup_hierarchy_walk_init
> - Remove some duplication
> - Fix style
> - Merge branch 'master' into jdk-8322420_cgroup_hierarchy_walk_init
> - Merge branch 'master' into jdk-8322420_cgroup_hierarchy_walk_init
> - Merge branch 'master' into jdk-8322420_cgroup_hierarchy_walk_init
> - 8322420: [Linux] cgroup v2: Limits in parent nested control groups are not detected
Hi @jerboaa,
Had a look. I am no expert in these things, so this concentrates more on the mechanical parts and on readability.
Cheers, Thomas
src/hotspot/os/linux/cgroupUtil_linux.cpp line 58:
> 56: log_trace(os, container)("Adjusting controller path for memory: %s", mem->subsystem_path());
> 57: assert(mem->cgroup_path() != nullptr, "invariant");
> 58: char* orig = os::strdup(mem->cgroup_path());
Why do we strdup orig?
Ah, I see, because we need to modify the controller object because the only way to get the new limit value is to ask it, and so it needs to be set to the new path, but that new path may turn out to be wrong, so we need to restore the original path. And we cannot just create a new controller on the fly to check the limit because all we have here is the abstract `CgroupMemoryController` interface, and we don't know if we deal with V1 or V2 here. Sigh...
src/hotspot/os/linux/cgroupUtil_linux.cpp line 64:
> 62: jlong limit = mem->read_memory_limit_in_bytes(phys_mem);
> 63: bool path_iterated = false;
> 64: while (limit < 0 && (last_slash = strrchr(cg_path, '/')) != cg_path) {
Is it guaranteed that cg_path starts with / ? otherwise, you have a write-to-NULL below. maybe assert that `cg_path[0] == '/'`
src/hotspot/os/linux/cgroupUtil_linux.cpp line 67:
> 65: *last_slash = '\0'; // strip path
> 66: // update to shortened path and try again
> 67: mem->set_subsystem_path(cg_path);
Does this work with zero-length strings? If all that was left was "/" and we just stripped that away?
src/hotspot/os/linux/cgroupUtil_linux.cpp line 71:
> 69: path_iterated = true;
> 70: if (limit > 0) {
> 71: log_trace(os, container)("Adjusted controller path for memory to: %s", mem->subsystem_path());
Please trace out the found limit as well
src/hotspot/os/linux/cgroupUtil_linux.cpp line 80:
> 78: os::free(cg_path);
> 79: if (path_iterated) {
> 80: mem->set_subsystem_path((char*)"/");
set_subsystem_path should really take a const char*, and this cast should not be necessary.
src/hotspot/os/linux/cgroupUtil_linux.hpp line 40:
> 38: // Iterate over the cpu controller hierarchy adjusting the path to the
> 39: // smallest observable limit (if any)
> 40: static CgroupCpuController* adjust_controller(CgroupCpuController* c);
We never return a new controller, but always the old one. I think I would simplify and clarify the API and return void. I would not allow for API chaining here, the memory ownerships are difficult enough to follow as it is. Could also return boolean for success=limit found. And comment should make it clear we operate on the given controller.
Proposal:
Given a memory controller, adjust its path to a point in the hierarchy
that represents the closest memory limit. Returns true if such limit was
found, false if not.
src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp line 41:
> 39: * on the contents of the mountinfo and cgroup files.
> 40: */
> 41: void CgroupV1Controller::set_subsystem_path(char *cgroup_path) {
make argument const char*
src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp line 42:
> 40: */
> 41: void CgroupV1Controller::set_subsystem_path(char *cgroup_path) {
> 42: if (_cgroup_path != nullptr) {
Strictly speaking the is-not-null condition is not necessary when you free stuff. nullptr are ignored on free.
src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp line 46:
> 44: }
> 45: if (_path != nullptr) {
> 46: os::free(_path);
Please set _path to nullptr, the dangling pointer makes me nervous.
src/hotspot/os/linux/cgroupV1Subsystem_linux.cpp line 78:
> 76: bool CgroupV1Controller::needs_hierarchy_adjustment() {
> 77: assert(_cgroup_path != nullptr, "sanity");
> 78: return strcmp(_root, _cgroup_path) != 0;
A comment maybe explaining the logic?
src/hotspot/os/linux/cgroupV1Subsystem_linux.hpp line 181:
> 179: _cpuset(cpuset),
> 180: _cpu(new CachingCgroupController<CgroupCpuController>(
> 181: CgroupUtil::adjust_controller(cpu))),
This should go into the cpp file, its deep implementation stuff.
src/hotspot/os/linux/cgroupV2Subsystem_linux.hpp line 52:
> 50: _path(o._path) {
> 51: _cgroup_path = o._cgroup_path;
> 52: _mount_point = o._mount_point;
I think the constructors should move to the C++ file.
src/hotspot/os/linux/cgroupV2Subsystem_linux.hpp line 142:
> 140: CgroupUtil::adjust_controller(memory))),
> 141: _cpu(new CachingCgroupController<CgroupCpuController>(
> 142: CgroupUtil::adjust_controller(cpu))) {
I would move this to the C++ file.
src/hotspot/os/linux/os_linux.hpp line 34:
> 32: class os::Linux {
> 33: friend class CgroupSubsystem;
> 34: friend class CgroupUtil;
why?
-------------
Changes requested by stuefe (Reviewer).
PR Review: https://git.openjdk.org/jdk/pull/20646#pullrequestreview-2263415888
PR Review Comment: https://git.openjdk.org/jdk/pull/20646#discussion_r1732921540
PR Review Comment: https://git.openjdk.org/jdk/pull/20646#discussion_r1732924128
PR Review Comment: https://git.openjdk.org/jdk/pull/20646#discussion_r1732927774
PR Review Comment: https://git.openjdk.org/jdk/pull/20646#discussion_r1732920679
PR Review Comment: https://git.openjdk.org/jdk/pull/20646#discussion_r1732933505
PR Review Comment: https://git.openjdk.org/jdk/pull/20646#discussion_r1732960773
PR Review Comment: https://git.openjdk.org/jdk/pull/20646#discussion_r1732930636
PR Review Comment: https://git.openjdk.org/jdk/pull/20646#discussion_r1732917524
PR Review Comment: https://git.openjdk.org/jdk/pull/20646#discussion_r1732916582
PR Review Comment: https://git.openjdk.org/jdk/pull/20646#discussion_r1732965090
PR Review Comment: https://git.openjdk.org/jdk/pull/20646#discussion_r1732881840
PR Review Comment: https://git.openjdk.org/jdk/pull/20646#discussion_r1732967369
PR Review Comment: https://git.openjdk.org/jdk/pull/20646#discussion_r1732970172
PR Review Comment: https://git.openjdk.org/jdk/pull/20646#discussion_r1732953424
More information about the hotspot-runtime-dev
mailing list