RFR: 8265836: OperatingSystemImpl.getCpuLoad() returns incorrect CPU load inside a container [v2]
Severin Gehwolf
sgehwolf at openjdk.java.net
Fri May 21 16:16:19 UTC 2021
On Tue, 11 May 2021 15:48:43 GMT, Hao Tang <github.com+7947546+tanghaoth90 at openjdk.org> wrote:
>>> Thanks for linking that. It sounds reasonable to me to prefer `quota` in that case.
>>
>> Yes, flag `PreferContainerQuotaForCPUCount` is [true by default](https://github.com/openjdk/jdk/blob/739769c8/src/hotspot/os/linux/globals_linux.hpp#L62). Therefore, [my current implementation](https://github.com/openjdk/jdk/pull/3656/commits/b052b624c84) might be a minimal implementation.
>>
>> We can also cover the case where `PreferContainerQuotaForCPUCount` is false. There are two different ways.
>> 1. To access the value of `PreferContainerQuotaForCPUCount`, so that we can decide if we should use `quota` or (`quota` & `share`);
>> 2. Reuse `CgroupSubsystem::active_processor_count`. However, the function returns an integer. It is more reasonable to use a floating number.
>>
>> Looking forward to your suggestion.
>
>> @tanghaoth90 My local testing suggests that your fix addresses the issue of CPU quotas set via `--cpu-quota/--cpu-period`. When using `--cpu-shares` the CPU load calculation is wrong since it will (wrongly) report host values. Lets look at them individually, fix the quota and shares case individually (i.e. when not both are set). Once that's done, quota will be preferred in the OperatingSystemMXBean impl, which is reasonable. I don't think we need to account for the shares-preferred-over-quota at this point since that changed in HotSpot in JDK 11 time-frame (JDK-8197589) and OperatingSystemMXBean has only been made container aware in JDK 14 (yes, it got backported, but still).
>
> @jerboaa I have updated my fix by inserting a new branch for `--cpu-shares`. By setting up containers with different restrictions based on your suggestion, I noticed two problems for the CPU load calculation. Take [TestLoad.java](https://bugs.openjdk.java.net/secure/attachment/94530/TestLoad.java) as the example.
>
> 1. `--cpu-quota=700000 --cpu-period=100000` as the restriction: the result is getting close to `6/7` slowly as time goes by, which indicates that the result of
>
> long usageNanos = containerMetrics.getCpuUsage();
>
> is an accumulated CPU usage rather than a real-time CPU usage.
> According to the javadoc, `getCpuLoad()` _returns the "recent cpu usage"_. The current fix obviously does not address this issue.
>
> 2. `--cpu-shares=2048` as the restriction: the "cpu-share" branch only returns `-1` since `containerMetrics.getCpuNumPeriods()` returns `0` (`nr_periods` of `cpu.stat` stays `0` when only `--cpu-shares` is set). By now, I have no idea to compute the elapsed time or the total available CPU time with the help of the methods of `jdk.internal.platform.CgroupMetrics`.
>
> Any suggestion is appreciated. @jerboaa @argha-c
@tanghaoth90
As to your observation of 1). Yes, that's true and I'm not sure it's worth changing. We could change the javadoc.
As for 2) I've created a PR for your branch with a suggested implementation for cpu shares based accounting. It uses a similar heuristics than the getCpuLoad0() uses here:
https://github.com/openjdk/jdk/blob/master/src/jdk.management/linux/native/libmanagement_ext/UnixOperatingSystem.c#L278
That's the best I could think of. It's certainly more accurate than the current implementation.
-------------
PR: https://git.openjdk.java.net/jdk/pull/3656
More information about the serviceability-dev
mailing list