[PING] RFR: 8231111: Cgroups v2: Rework Metrics in java.base so as to recognize unified hierarchy

Bob Vandette bob.vandette at oracle.com
Mon Dec 2 21:06:30 UTC 2019



> On Dec 2, 2019, at 3:33 PM, Severin Gehwolf <sgehwolf at redhat.com> wrote:
> 
> On Mon, 2019-12-02 at 14:26 -0500, Bob Vandette wrote:
>>> On Dec 2, 2019, at 2:03 PM, Severin Gehwolf <sgehwolf at redhat.com> wrote:
>>> 
>>> Hi Bob,
>>> 
>>> On Mon, 2019-12-02 at 12:13 -0500, Bob Vandette wrote:
>>>> Sorry for the delay in responding.  See comments below:
>>> 
>>> Thanks very much for taking a look at this!
>>> 
>>>>> On Nov 29, 2019, at 4:05 AM, Severin Gehwolf <sgehwolf at redhat.com> wrote:
>>>>> 
>>>>> On Fri, 2019-11-15 at 17:51 +0100, Severin Gehwolf wrote:
>>>>>> Hi,
>>>>>> 
>>>>>> Could I please get reviews of these core-libs, Linux-only, changes to
>>>>>> the Metrics subsystem? This patch implements cgroupv2 support for
>>>>>> Metrics which are currently cgroupv1-only. Fedora 31 switched to
>>>>>> cgroupv2 by default so it's time to get OpenJDK recognize it.
>>>>>> 
>>>>>> Note that a couple of metrics are no longer supported with cgroupv2.
>>>>>> Most notably (not an exhaustive list, though):
>>>>>> 
>>>>>> Metrics.getKernel*() family of methods.
>>>>>> Metrics.getTcp*() family of methods.
>>>>>> Metrics.getBlkIO*() family of methods.
>>>>>> Metrics.isMemoryOOMKillEnabled()
>>>>>> 
>>>>>> A couple of open questions with regards to that:
>>>>>> 
>>>>>> 1)
>>>>>> Most API docs of Metrics make no distiction between "unlimited" and
>>>>>> "not supported", both returning -1 for longs, for example. This is a
>>>>>> problem, because output of "java -XshowSettings:system -version" will
>>>>>> not distinguish between unlimited and not supported metrics. Would it
>>>>>> be acceptable to change the API to distinguish those cases so that
>>>>>> LauncherHelper could display them appropriately?
>>>>>> 
>>>>>> 2)
>>>>>> How should we deal with "not supported" for booleans/arrays, etc.?
>>>>>> Would it make sense to return record objects from the Metrics API so
>>>>>> that this could be dealt with? E.g.
>>>>>> 
>>>>>> Metrics m = ...
>>>>>> MetricResult<int[]> result = m.getCpuSetCpus();
>>>>>> switch(result.getType()) {
>>>>>> case NOT_SUPPORTED: /* do something */; break;
>>>>>> case SUPPORTED: int[] val = result.get(); break;
>>>>>> ...         
>>>>>> }
>>>>>> 
>>>>>> I'm bringing this up, because the proposed patch doesn't deal with the
>>>>>> above open questions as of yet. With that being said, it's mostly
>>>>>> identical to the proposed hotspot changes in [1].
>>>> 
>>>> For issue 1 and 2 …
>>>> 
>>>> I wonder if we should change the API to throw UnsupportedOperationException for those methods 
>>>> that are not available.  This exception is used quite a lot in the java/nio and java/net packages
>>>> for dealing with functionality not available on a host platform.
>>>> As an alternate suggestion, we already return -2 for "not supported" for most APIs in the hotspot
>>>> implementation.  We could use this for non boolean values in the Metrics API.  For boolean
>>>> values, we could change the API to return “Boolean”.  A null return would signify not
>>>> implemented.
>>> 
>>> I'd be fine with either way. We just need to pick one. Should I go
>>> ahead with UnsupportedOperationException?
>> 
>> The change to use -2 and Boolean is less intrusive and more consistent with the hotspot
>> implementation, so I’d pick that one.
> 
> OK. I'm travelling currently, so expect a bit of delay for the update.

No prob.

> 
>>>>>> Bug: https://bugs.openjdk.java.net/browse/JDK-8231111
>>>>>> webrev: http://cr.openjdk.java.net/~sgehwolf/webrevs/JDK-8231111/04/webrev/
>>>>>> 
>>>> 
>>>> http://cr.openjdk.java.net/~sgehwolf/webrevs/JDK-8231111/04/webrev/src/java.base/linux/classes/jdk/internal/platform/CgroupSubsystemFactory.java.html
>>>> 
>>>> Should we check to make sure that there are no mixed cgroupv1 and cgroupv2 mounted subsystems since
>>>> you are not supporting mixing.
>>> 
>>> I'm not sure what you mean. The kernel won't allow for controllers to
>>> be used by cgroupv1 *and* cgroupv2 at the same time. Each controller
>>> would have to be assigned to a control hierarchy (v1 or v2)[1]. If you
>>> mean hybrid, that's basically cgroups v1. The factory will look at the
>>> system and detect the version in use: cgroups v1 xor cgroups v2.
>> 
>> The memory subsystem for example can’t be assigned to both v1 and v2 controllers but the
>> memory subsystem can be assigned to v1 and the cpu subsystem can be assigned to
>> v2, right?
>> 
>> That’s the case I’m concerned about. 
> 
> I see. What exactly would you want to happen if mixing is being
> detected? Should it return null (no metrics)? I can do that.

It would be good to provide a warning and return null.

> 
> TBH, I'm not sure what would happen currently (prior this patch). There
> seems to be some inconsistency how hotspot/core libs are dealing with
> this. For hotspot, if any one of the controllers is not available, no
> controller is being used. That's different to how core libs works right
> now.
> 
> Any thoughts about this?

In order for Hotspot to implement container ergonomics it needs memory and the cpu* subsystems.
The Metrics methods are merely informative so it can continue with as many subsystems it can find as
long as the others return not available.

> 
>>>> http://cr.openjdk.java.net/~sgehwolf/webrevs/JDK-8231111/04/webrev/src/java.base/linux/classes/jdk/internal/platform/cgroupv2/CgroupV2Subsystem.java.html
>>>> 
>>>> It looks looks like there may be alternate ways of reporting some of the metrics that you have classified as RETVAL_NOT_SUPPORTED.
>>>> 
>>>> BlkIOServicedCount (io.stat)
>>>> KernelMemory (memory.stat)
>> I was thinking of memory.stat (file+kernel_stack+slab) assuming these
>> aren’t account for in user memory totals.
> 
> It's not clear whether they are. So this seems to be getting into some
> guesswork, which I'd rather avoid. The memory controller mentions this:
> 
> """"
> While not completely water-tight, all major memory usages by a given
> cgroup are tracked so that the total memory consumption can be
> accounted and controlled to a reasonable extent.  Currently, the
> following types of memory usages are tracked.
> 
> - Userland memory - page cache and anonymous memory.
> 
> - Kernel data structures such as dentries and inodes.
> 
> - TCP socket buffers.
> 
> The above list may expand in the future for better coverage.
> """"
> 
> Given the above and the description of 'slab' for memory.stat it's not
> clear there is overlap.

Rather than guessing, maybe some kernel source exploration might 
confirm the right files to use.  Are there any kernel guys you know that
can help?

It seems pretty clear from the description that that kernel_stack and slab are kernel specific.

> 
>>>> TcpMemory (memory.stat)
>> I was thinking of memory.stat (sock)
> 
> OK. Sure.
> 
>>> Could you be more specific, please? What exaclty do you have in mind
>>> for KernelMemory from memory.stat? Same for TcpMemory.
>>> 
>>> For BlkIOServicedCount do you mean calculating the sum of all rios+wios
>>> for any device? Something else?
>> Yes, that’s what I had in mind but since it’s a single value for all I/O, you’d have to
>> sum up all devices.
> 
> OK. Let's do that then.
> 
>>>> You could try the same trick for getMemoryAndSwapMaxUsage which keeps track of the highest returned value.
>>> 
>>> Sure, will do if you insist. I'll note that this is a hack and if
>>> getMemoryUsage() is never called, it will be the same as
>>> getMemoryUsage() - provided I'll add a call to getMemoryUsage() from
>>> getMemoryMaxUsage(). Either way, the value returned is most likely
>>> wrong and I'm not a huge fan of it, TBH.
>> 
>> I agree it’s a hack so let’s just make both unsupported.
> 
> Sounds good!
> 
>>>> for the benefit of other reviewers, you should provide links to the cgroupv1 and v2 docs.
>>>> 
>>>> https://www.kernel.org/doc/Documentation/cgroup-v2.txt
>>>> 
>>>>>> Testing: jdk/submit and platform docker tests on Linux x86_64 (with
>>>>>> hybrid hierarchy, docker/podman) and on Linux x86_64 with unified
>>>>>> hierarchy (podman only).
>>>>>> 
>>>>>> Thoughts? Suggestions?
>>>> 
>>>> Do you think we should check the docker version being used for the tests to make sure it
>>>> supports cgroupv2?  I assume a fairly recent version is required to work with cgroupv2.
>>> 
>>> I don't think that will be needed. The container engine won't start if
>>> cgroups v2 isn't supported. So we can be sure that if docker is
>>> running, it's either going to be on cgroups v2 or cgroups v1. In both
>>> cases the right set of tests will run. Does that make sense?
>> 
>> I’m just thinking of a situation where a kernel is configured to use cgroupv2
>> and a developer wants to run the jtreg tests.  It would be good to skip the
>> container tests if docker can’t run on it.  We have a check to see if docker can
>> run before running docker tests “using docker ps" but AFAIK, it doesn’t test starting
>> a container or verify the version.  
> 
> Right. My thinking is that "using docker ps" won't work on cgroups v2
> if docker isn't supporting it in the first place. There is also some
> logic to skip the test if building the container fails:
> http://hg.openjdk.java.net/jdk/jdk/file/d8b5e32ffa2f/test/lib/jdk/test/lib/containers/docker/DockerTestUtils.java#l178
> 
> Assuming that docker builds pass again gives one reasonable assurance
> that cgroups (whatever version) is supported. I'm not a great fan to
> have some arbitrary docker version check which would be useless in the
> podman case anyway. In all likelihood it would be covered by current
> logic anyway.
> 
> More thoughts?

It looks like docker still can’t handle cgroups v2 so I take back my suggestion.
I thought they’d be further along by now.
It looks like Docker and Kubernetes are going to have early access  support for
cgroupsv2 early next year so we can take a look at doing some validation at that time.

https://medium.com/nttlabs/cgroup-v2-596d035be4d7

Bob.


> 
> Thanks,
> Severin
> 
>> Bob.
>> 
>> 
>>> Thanks,
>>> Severin
>>> 
>>>> Bob.
>>> 
>>> [1] https://lists.freedesktop.org/archives/systemd-devel/2017-November/039754.html
>>>> 
>>>>> Ping?
>>>>> 
>>>>>> Thanks,
>>>>>> Severin
>>>>>> 
>>>>>> [1] http://mail.openjdk.java.net/pipermail/hotspot-dev/2019-November/039909.html
> 



More information about the core-libs-dev mailing list