RFR: 8322420: [Linux] cgroup v2: Limits in parent nested control groups are not detected [v2]

Jan Kratochvil jkratochvil at openjdk.org
Tue Feb 20 15:15:09 UTC 2024


On Wed, 31 Jan 2024 15:33:01 GMT, Jan Kratochvil <jkratochvil at openjdk.org> wrote:

>> It looks like this patch needs a bit more discussion. Generally, it would be good to have the functionality, but I'm unsure about the proposed implementation.
>> 
>> Walking the directory hierarchy (**and** interface files) on every call of `physical_memory()` (or `active_processor_count()` for CPU) seems concerning. Since it seems unlikely for cgroup interface file paths changing during runtime, walking the hierarchy for every look-up seems overkill. Note that the prime users of this feature are in-container use, where most runtimes have the limit settings at the leaf. A few more comments below:
>> 
>> - After this patch, AFAIUI for a cgroup path like `/foo/bar/baz/memory.max` the following files are being looked at for the memory limit (for example for the interface file `memory.max`):
>>   ```
>>    /foo/bar/baz/memory.max
>>    /foo/bar/memory.max
>>    /foo/memory.max
>>   ```
>>   This used to be just one file at the leaf, `/foo/bar/baz/memory.max` (prior this patch). So this change has an effect on file IO. `N` files per metric to look at where it used to be one file (i.e. constant). Note that switches like `UseDynamicNumberOfCompilerThreads` make this particularly worrying.  I wonder if it would be sufficient to "lock" into the cgroup path where the lowest  limits are set in the hierarchy and only use the interface files at that path of a specific controller. This would mean adjusting `CgroupsV2Subsystem` similar to `CgroupsV1Subsystem` to have unique controller paths (i.e. `cpu` and `memory` controller paths could potentially be different). But the code would already be mostly there for this. The idea would be to do the initial walk of the tree at JVM startup, and from then on, only look at the path with a limit set (if any) for subsequent calls. That is `controller->subsystem_path()` would return the correct path at runtime when initialization is done
 . Thoughts?
>> - There seems to be an inconsistency between cgroups v1 (no recursive look-up) and cgroups v2  (recursive look-up). Why? I think we should employ the same semantics to both  versions (and drop the hierarchical work-around - [JDK-8217338](https://bugs.openjdk.java.net/browse/JDK-8217338) - for cg v1).
>> - There also seems to be an inconsistency between metrics being iterated. `memory.max` and `memory.swap.max`  and `memory.swap.current` are being iterated, others, like CPU quotas (`cpu.max` or
>>   `cpu.weight`) not. Why? Both limits could potentially be one path up the hierarchy, meaning that cp...
>
>> Walking the directory hierarchy (**and** interface files) on every call of `physical_memory()` (or `active_processor_count()` for CPU) seems concerning.
> 
> I have therefore implemented a hierarchical `memory.stat` extension for cgroup2 which already exists for cgroup1 - if the patch is agreed upon I will submit it to Linux kernel: [cgroup2-hierarchical_memory_limit.patch.txt](https://github.com/openjdk/jdk/files/14113452/cgroup2-hierarchical_memory_limit.patch.txt)
> 
>> Since it seems unlikely for cgroup interface file paths changing during runtime, walking the hierarchy for every look-up seems overkill. Note that the prime users of this feature are in-container use, where most runtimes have the limit settings at the leaf.
> 
> While I understand the most common change of the limits is at the leaf technically any parent group limit can change. Which is also what this patch is implementing. For the performance issue I have implemented the Linux kernel extension above.
> 
>> `N` files per metric to look at where it used to be one file (i.e. constant).
> 
> The problem is any of the files can change. But to fix the performance I have implemented the Linux kernel patch above.
> 
>> I wonder if it would be sufficient to "lock" into the cgroup path where the lowest  limits are set in the hierarchy and only use the interface files at that path of a specific controller.
> 
> That mostly disables the functionality of this patch.
> 
> 
>>     * There seems to be an inconsistency between cgroups v1 (no recursive look-up) and cgroups v2  (recursive look-up). Why? I think we should employ the same semantics to both  versions (and drop the hierarchical work-around - [JDK-8217338](https://bugs.openjdk.java.net/browse/JDK-8217338) - for cg v1).
> 
> I did not much expect anyone still uses cgroup1. Plus customer was also interested in cgroup2. AFAIK a last OS using cgroup1 was RHEL-7 which will have EOL in a few months. But then I tried to implement cgroup1 but there it sort of already worked thanks for your [JDK-8217338](https://bugs.openjdk.java.net/browse/JDK-8217338). I just fixed there to prefer the hierarchical limit over the leaf limit (and the testcase does test it now) as that is what is used by Linux kernel.
> 
> And then my Linux kernel patch implements the same `memory.stat` way even for cgroup2.
> 
>>     * There also seems to be an inconsistency between metrics being iterated. `memory.max` and `memory.swap.max`  and `memory.swap.current` are being iterated, others, like CPU quotas (`cpu.max` or...

> @jankratochvil The "use hierarchical limit" was a work-around that bites us now.

What is the current problem? I see one problem of the cgroup1 hierarchical implementation in OpenJDK which is fixed by this patch incl. a testcase.


> So instead we should attempt to unify cgv1 and cgv2 by walking the hierarchy

When you are concerned about the performance I have rather implemented the hierarchical interface also for Linux kernel.
  https://lore.kernel.org/linux-mm/ZcvlhOZ4VBEX9raZ@host1.jankratochvil.net/T/
It has been welcomed by SuSE:
  https://lore.kernel.org/linux-mm/8f8d5a2d-dde3-42e5-9988-fab042666f40@redhat.com/T/#m1238300cf818badebed0183f1b5ab798bf1f2e9f


> and find the place in the hierarchy where the interface files - with actual limits imposed - are to be found.

That is being done if the patch above is not found on the running kernel.


> Since this patch attempts to solve a similar goal to what the old work-around solved for cg v1, we should re-think how to solve it properly. My suggestion would be to go with walking the hierarchy for both: cg v1 and cg v2.

I hope I have fixed the same problem in a more performance wise way.


> As to the walking the hierarchy on every invocation problem, I don't think fixing it by relying on a kernel patch is the way to go.

Why not to support both ways. Time flies like wind.


> It'll take time for the ecosystem to pick those up and existing cg v2 systems won't be fixed (consider RHEL 8 or RHEL 9 systems and their derivatives).

Particuarly for RHEL it depends whether any of the paying customers is interested in the performance wise solution (I doubt so as the performance difference is small). Backport for a hotfix delivery for the specific customer will be done in a few days. Included in the next RHEL y-release it will be in a few months (and also released then in CentOS).


> Therefore, the idea would be to do the iteration **once** when the `OSContainer` code initializes.

Done that in this patch. I think it could also occasionally re-read the hierarchical structure but that hasn't been implemented.

I expect I would remove the "memory.max.effective" dependency from this patch for the OpenJDK commit as the kernel patch has not been approved yet and so the kernel interface ("memory.max.effective") still may change.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/17198#issuecomment-1954420583


More information about the core-libs-dev mailing list