RFR: 8265836: OperatingSystemImpl.getCpuLoad() returns incorrect CPU load inside a container [v2]

Hao Tang github.com+7947546+tanghaoth90 at openjdk.java.net
Tue May 11 15:52:01 UTC 2021


On Thu, 6 May 2021 02:34:01 GMT, Hao Tang <github.com+7947546+tanghaoth90 at openjdk.org> wrote:

>> Thanks for linking that. It sounds reasonable to me to prefer `quota` in that case.
>
>> Thanks for linking that. It sounds reasonable to me to prefer `quota` in that case.
> 
> Yes, flag `PreferContainerQuotaForCPUCount` is [true by default](https://github.com/openjdk/jdk/blob/739769c8/src/hotspot/os/linux/globals_linux.hpp#L62). Therefore, [my current implementation](https://github.com/openjdk/jdk/pull/3656/commits/b052b624c84) might be a minimal implementation.
> 
> We can also cover the case where `PreferContainerQuotaForCPUCount` is false. There are two different ways.
> 1. To access the value of `PreferContainerQuotaForCPUCount`, so that we can decide if we should use `quota` or (`quota` &  `share`);
> 2. Reuse `CgroupSubsystem::active_processor_count`. However, the function returns an integer. It is more reasonable to use a floating number.
> 
> Looking forward to your suggestion.

> @tanghaoth90 My local testing suggests that your fix addresses the issue of CPU quotas set via `--cpu-quota/--cpu-period`. When using `--cpu-shares` the CPU load calculation is wrong since it will (wrongly) report host values. Lets look at them individually, fix the quota and shares case individually (i.e. when not both are set). Once that's done, quota will be preferred in the OperatingSystemMXBean impl, which is reasonable. I don't think we need to account for the shares-preferred-over-quota at this point since that changed in HotSpot in JDK 11 time-frame (JDK-8197589) and OperatingSystemMXBean has only been made container aware in JDK 14 (yes, it got backported, but still).

@jerboaa I have updated my fix by inserting a new branch for `--cpu-shares`. By setting up containers with different restrictions based on your suggestion, I noticed two problems for the CPU load calculation. Take [TestLoad.java](https://bugs.openjdk.java.net/secure/attachment/94530/TestLoad.java) as the example.

1. `--cpu-quota=700000 --cpu-period=100000` as the restriction: the result is getting close to `6/7` slowly as time goes by, which indicates that the result of 

long usageNanos = containerMetrics.getCpuUsage();

is an accumulated CPU usage rather than a real-time CPU usage.
According to the javadoc, `getCpuLoad()` _returns the "recent cpu usage"_. The current fix obviously does not address this issue.

2. `--cpu-shares=2048` as the restriction: the "cpu-share" branch only returns `-1` since `containerMetrics.getCpuNumPeriods()` returns `0` (`nr_periods` of `cpu.stat` stays `0` when only `--cpu-shares` is set). By now, I have no idea to compute the elapsed time or the total available CPU time with the help of the methods of `jdk.internal.platform.CgroupMetrics`.

Any suggestion is appreciated. @jerboaa @argha-c

-------------

PR: https://git.openjdk.java.net/jdk/pull/3656


More information about the serviceability-dev mailing list