[PING] RFR: 8231111: Cgroups v2: Rework Metrics in java.base so as to recognize unified hierarchy
Bob Vandette
bob.vandette at oracle.com
Tue Feb 11 21:59:38 UTC 2020
I applied your patch to the latest JDK 15 sources and ran the container
tests on Oracle Linux 8.1 with podman/cgroupv2 enabled. There were some issues.
I’m not sure if its my setup or not.
I also ran the same build on Ubuntu with docker/cgroupv1. I didn't see any failures on
the cgroupv1 system.
Here are some notes:
The podman version on OL 8.1 doesn't yes support cgroupv2. I
built the latest from sources for this test.
The docker tests take a very long time under podman! Longer than
the cgroupv1 run.
cpusets and cpusets.mems are blank on host and if none are specified
on podman/docker run command. On cgroupv1 they are host values if not
specified for container. Effective cpusets and cpusets.mems are set properly
in a container.
HOST OUTPUT:
./java -XshowSettings:system -version
Operating System Metrics:
Provider: cgroupv2
Effective CPU Count: 32
CPU Period: -1
CPU Quota: -1
CPU Shares: -1
List of Processors: N/A
List of Effective Processors: N/A
List of Memory Nodes: N/A
List of Available Memory Nodes: N/A
Memory Limit: Unlimited
Memory Soft Limit: Unlimited
Memory & Swap Limit: Unlimited
CONTAINER OUTPUT:
# podman run -it -v `pwd`:/mnt ubuntu bash
root at 3c6654a3b834:/mnt/jdk/bin# ./java -XshowSettings:system -version
Operating System Metrics:
Provider: cgroupv2
Effective CPU Count: 32
CPU Period: 100000us
CPU Quota: -1
CPU Shares: -1
List of Processors: N/A
List of Effective Processors, 32 total:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
List of Memory Nodes: N/A
List of Available Memory Nodes, 1 total:
0
Memory Limit: Unlimited
Memory Soft Limit: Unlimited
Memory & Swap Limit: Unlimited
Docker tests fail if /bin/docker is not available in podman setup. We
probably should enhance the docker check to also look for podman.
Two container tests failed:
FAILED: containers/cgroup/PlainRead.java failed Memory Limit is: -2 instead
of unlimited or -1. This is because memory.max is not foumd.
FAILED: jdk/internal/platform/cgroup/TestCgroupMetrics.java
This fails because nr_periods line doesn't always exist. I think you’ve got
to enable a quota for this to appear (not sure).
Here’s the contents:
% more cpu.stat
usage_usec 23974562755
user_usec 22257183568
system_usec 1717379186
CGROUPV2 results on Oracle Linux 8.1
---------
testing container APIs
FAILED: containers/cgroup/PlainRead.java
Passed: containers/docker/DockerBasicTest.java
Passed: containers/docker/TestCPUAwareness.java
Passed: containers/docker/TestCPUSets.java
Passed: containers/docker/TestJcmdWithSideCar.java
Passed: containers/docker/TestJFREvents.java
Passed: containers/docker/TestJFRNetworkEvents.java
Passed: containers/docker/TestMemoryAwareness.java
Passed: containers/docker/TestMisc.java
Test results: passed: 8; failed: 1
Results written to /export/users/bobv/jdk15/build/jtreg/JTwork
Error: Some tests failed or other problems occurred.
testing jdk.internal.platform APIs
FAILED: jdk/internal/platform/cgroup/TestCgroupMetrics.java
Passed: jdk/internal/platform/cgroup/TestCgroupSubsystemController.java
Passed: jdk/internal/platform/docker/TestDockerCpuMetrics.java
Passed: jdk/internal/platform/docker/TestDockerMemoryMetrics.java
Passed: jdk/internal/platform/docker/TestSystemMetrics.java
Test results: passed: 4; failed: 1
Results written to /export/users/bobv/jdk15/build/jtreg/JTwork
Error: Some tests failed or other problems occurred.
testing -XshowSettings:system launcher option
Passed: tools/launcher/Settings.java
Test results: passed: 1
Bob.
> On Feb 11, 2020, at 1:04 PM, Severin Gehwolf <sgehwolf at redhat.com> wrote:
>
> Hi Mandy, Bob,
>
> Thanks again for the reviews and patience on this. Sorry it took me so
> long to get back to this :-/
>
> Updated webrev:
> Full: http://cr.openjdk.java.net/~sgehwolf/webrevs/JDK-8231111/10/webrev/
> incremental: http://cr.openjdk.java.net/~sgehwolf/webrevs/JDK-8231111/10/incremental/webrev/
>
> I've tested this with docker tests on cgroup v1 and via podman on a
> cgroup v2 system. They pass. I'll be running this through jdk-submit as
> well.
>
> More below.
>
> On Tue, 2020-01-21 at 16:09 -0800, Mandy Chung wrote:
>> Hi Severin,
>>
>> Thanks for the update.
>>
>> On 1/21/20 11:30 AM, Severin Gehwolf wrote:
>>> Full: http://cr.openjdk.java.net/~sgehwolf/webrevs/JDK-8231111/09/webrev/
>>> incremental: http://cr.openjdk.java.net/~sgehwolf/webrevs/JDK-8231111/09/incremental/webrev/
>>>
>>
>> I have answered my own question. Most of the metrics used to return 0 if unavailable due to IOException reading a file or malformed content and now are changed to return -2 due to error fetching the metric.
>>
>> The following are about limits which used to return -1 if unlimited or no limit set.
>> public long getCpuQuota();
>> public long getCpuShares();
>> public long getMemoryLimit();
>> public long getMemoryAndSwapLimit();
>> public long getMemorySoftLimit();
>>
>> With this patch, only getMemoryLimit and getMemoryAndSwapLimit specify to return -1 if unlimited or no limit set. However the implementation does return -1. All of the above specify to return -2 if unavailable due to error fetching the metric.
>>
>> I found the implementation quite hard to follow. I spent some time reviewing the code to see if the implementation matches the spec but I can't easily tell yet. For example,
>> CgroupSubsystemController::getLongValueMatchingLine returns -1 when IOException occurs.
>> CgroupSubsystemController::getLongEntry returns 0L if IOException occurs.
>>
>> CgroupV1SubsystemController::convertStringToLong returns Long.MAX_VALUE
>> when value overflows
>
> This one is intentional. It's mapped back to unlimited via
> longValOrUnlimited(). The reason for this is that cgroup v1 doesn't
> have a concept of "unlimited". Unlimited values will be a very large
> numbers in cgroup v1 files.
>
>> CgroupV2SubsystemController::convertStringToLong returns -1 when IOException occurs
>>
>> CgroupV1Subsystem::getCpuShares return -1 if cpu.shares == 0 or 1024
>> CgroupV2Subsystem::getCpuShares returns -1 if cpu.weight == 100 or 0
>
> These two are special cases too. See the implementation note of
> Metrics.getCpuShares(). In the cgroup v2 case the default value is 100
> (over 1024 in cgroup v1). That's why unlimited is being returned for
> those values.
>
>> CgroupV2Subsystem::getFromCpuMax returns -1 if error in reading cpu.max or malformed content
>> CgroupV2Subsystem::sumTokensIOStat returns -2 if IOException error
>> This is called by getBlkIOServiceCount and getBlkIOServiced
>>
>> I think this can be improved and add the documentation to describe
>> what the methods do. Since Metrics APIs consistently return -2 if
>> unavailable due to error in fetching the metric, why some utility
>> methods in *Subsystem and *SubsystemController return -1 upon error
>> and 0 when unlimited?
>>
>> I suspect if the getXXXValue and other methods are clearly documented
>> with the error cases (possibly renaming the method name if appropriate)
>> CgroupV1Subsystem and CgroupV2SubSystem will become very explicit
>> to understand.
>
> This should be fixed now.
>
> I've gone through the API doc of Metrics.java and have updated it. In
> general, I've updated it to return -1 if metric is unavailable (due to
> error in reading some files or content being empty), and -2 if not
> supported. No method returns -2 currently, but it might change and it's
> good to have some way of saying "not implementable" for this subsystem
> in the spec. That's my take on it anyway.
>
> There is also a new unit test for shared controller logic:
> TestCgroupSubsystemController.java
>
> It execises various cases of error/success.
>
> That is to ensure proper symmetry across the various cases (including
> IOException). I've also documented static methods in
> CgroupSubsystemController. Overall, all methods now return the same
> values for cgroup v1 and cgroup v2 (given the impl nuances) for the
> various cases.
>
>>
>> CgroupSubsystem.java
>>
>> 44 public static final double DOUBLE_RETVAL_NOT_SUPPORTED = LONG_RETVAL_NOT_SUPPORTED;
>> 49 public static final Boolean BOOL_RETVAL_NOT_SUPPORTED = null;
>>
>> They are no longer needed, right?
>
> Removed.
>
>>
>> CgroupSubsystemFactory.java
>>
>> 89 System.err.println("Warning: Mixed cgroupv1 and cgroupv2 not supported. Metrics disabled.");
>>
>>
>> I expect this be a System.Logger log
>
> Updated.
>
>> 114 if (!Integer.valueOf(0).toString().equals(tokens[0])) {
>>
>> This can be simplified to if (!"0".equals(tokens[0]))
>
> Done, thanks!
>
>> LauncherHelper.java
>>
>> 407 // Extended cgroupv1 specific metrics
>> 408 if (c instanceof MetricsCgroupV1) {
>> 409 MetricsCgroupV1 cgroupV1 = (MetricsCgroupV1)c;
>> 410 limit = cgroupV1.getKernelMemoryLimit();
>> 411 ostream.println(formatLimitString(limit, INDENT + "Kernel Memory Limit: "));
>> 412 limit = cgroupV1.getTcpMemoryLimit();
>> 413 ostream.println(formatLimitString(limit, INDENT + "TCP Memory Limit: "));
>> 414 Boolean value = cgroupV1.isMemoryOOMKillEnabled();
>> 415 ostream.println(formatBoolean(value, INDENT + "Out Of Memory Killer Enabled: "));
>> 416 value = cgroupV1.isCpuSetMemoryPressureEnabled();
>> 417 ostream.println(formatBoolean(value, INDENT + "CPUSet Memory Pressure Enabled: "));
>> 418 }
>>
>> MetricsCgroupV1 is linux-only. It will fail the compilation when
>> building on non-linux. One option is to move this code to
>> src/java.base/linux/share/sun/launcher/CgroupMetrics.java
>>
>> Are they continued to be interesting metrics to be output from
>> -XshowSetting? I wonder if they can simply be dropped from the output.
>> Bob will have an opinion.
>
> I've removed those extra cgroup v1 specific metrics printed via
> -XshowSettings:system. Not sure what to do with MetricsCgroupV1. It's
> only used in tests in webrev 10. On the other hand the idea would be
> for consumers to downcast it to MetricsCgroupV1 if they needed those
> extra metrics.
>
> Thanks,
> Severin
>
More information about the serviceability-dev
mailing list