[RFC containers] 8281181 JDK's interpretation of CPU Shares causes underutilization
Ioi Lam
ioi.lam at oracle.com
Mon Feb 7 04:16:30 UTC 2022
On 2/3/2022 3:29 AM, Severin Gehwolf wrote:
> Hi Ioi,
>
> On Wed, 2022-02-02 at 23:30 -0800, Ioi Lam wrote:
>> Please see the bug report [1] for detailed description and test cases.
>>
>> I'd like to have some discussion before we can decide what to do.
>>
>> I discovered this issue when analyzing JDK-8279484 [2]. Under Kubernetes
>> (minikube), Runtime.availableProcessors() returns 1, despite that the
>> fact the machine has 32 CPUs, the Kubernetes node has a single
>> deployment, and no CPU limits were set.
> From looking at the bug it would be good to know why a cpu.weight value
> of 1 is being obverved. The default is 100. I.e. if it is really unset:
>
> $ sudo docker run --rm -v $(pwd)/jdk17:/opt/jdk:z fedora:35 /opt/jdk/bin/java -Xlog:os+container=trace --version
> [0.000s][trace][os,container] OSContainer::init: Initializing Container Support
> [0.001s][debug][os,container] Detected cgroups v2 unified hierarchy
> [0.001s][trace][os,container] Path to /memory.max is /sys/fs/cgroup//memory.max
> [0.001s][trace][os,container] Raw value for memory limit is: max
> [0.001s][trace][os,container] Memory Limit is: Unlimited
> [0.001s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup//cpu.max
> [0.001s][trace][os,container] Raw value for CPU quota is: max
> [0.001s][trace][os,container] CPU Quota is: -1
> [0.001s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup//cpu.max
> [0.001s][trace][os,container] CPU Period is: 100000
> [0.001s][trace][os,container] Path to /cpu.weight is /sys/fs/cgroup//cpu.weight
> [0.001s][trace][os,container] Raw value for CPU shares is: 100
> [0.001s][debug][os,container] CPU Shares is: -1
> [0.001s][trace][os,container] OSContainer::active_processor_count: 4
> [0.001s][trace][os,container] CgroupSubsystem::active_processor_count (cached): 4
> [0.001s][debug][os,container] container memory limit unlimited: -1, using host value
> [0.001s][debug][os,container] container memory limit unlimited: -1, using host value
> [0.002s][trace][os,container] CgroupSubsystem::active_processor_count (cached): 4
> [0.007s][debug][os,container] container memory limit unlimited: -1, using host value
> [0.014s][trace][os,container] CgroupSubsystem::active_processor_count (cached): 4
> [0.022s][trace][os,container] Path to /memory.max is /sys/fs/cgroup//memory.max
> [0.022s][trace][os,container] Raw value for memory limit is: max
> [0.022s][trace][os,container] Memory Limit is: Unlimited
> [0.022s][debug][os,container] container memory limit unlimited: -1, using host value
> openjdk 17.0.2-internal 2022-01-18
> OpenJDK Runtime Environment (build 17.0.2-internal+0-adhoc.sgehwolf.jdk17u)
> OpenJDK 64-Bit Server VM (build 17.0.2-internal+0-adhoc.sgehwolf.jdk17u, mixed mode, sharing)
In JDK-8279484, the JVM is launched by Kubernetes, which manages CPU
resources with the concept of "request" and "limit".
https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/
Quote: "Containers cannot use more CPU than the configured limit.
Provided the system has CPU time free, a container
is guaranteed to be allocated as much CPU as it requests."
So "CPU request" is a guaranteed minimum. For example, if you have a
container that requests 6 CPUs, but all the hosts in your cluster have
no more than 4 CPUs each, then this container will never be deployed by
Kubernetes, because the minimum of 6 CPUs cannot be guaranteed.
Consider the following 4 cases:
(1) You specify both "cpu request" and "cpu limit"
(2) You specify only "cpu limit" -> Kubernetes will set the
"cpu request" to be the same as the limit.
(3) If you specify only "cpu request", Kubernetes will set
the "cpu limit" to a default value that's not smaller
than the request.
(4) Neither "cpu request" nor "cpu limit" is set
(For details about the defaults, see
https://kubernetes.io/docs/tasks/administer-cluster/manage-resources/cpu-default-namespace/
)
In the first 3 cases, the JVM (in cgroupv1) will see that both
cpu.cfs_quota_us and cpu.shares are set. The cpu.shares will be ignored
(due to the PreferContainerQuotaForCPUCount flag. See JDK-8197867).
Case (4) is the cause for the bug in JDK-8279484
Kubernetes set the cpu.cfs_quota_us to 0 (no limit) and cpu.shares to 2.
This means:
- This container is guaranteed a minimum amount of CPU resources
- If no other containers are executing, this container can use as
much CPU as available on the host
- If other containers are executing, the amount of CPU available
to this container is (2 / (sum of cpu.shares of all active
containers))
The fundamental problem with the current JVM implementation is that it
treats "CPU request" as a maximum value, the opposite of what Kubernetes
does. Because of this, in case (4), the JVM artificially limits itself
to a single CPU. This leads to CPU underutilization.
>> Specifically, I want to understand why the JDK is using
>> CgroupSubsystem::cpu_shares() to limit the number of CPUs used by the
>> Java process.
> TLDR: Kubernetes and/or other container orchestration frameworks? That
> was back in the day of cgroups v1, though.
>
>> In cgroup, there are other ways that are designed specifically for
>> limiting the number of CPUs, i.e., CgroupSubsystem::cpu_quota(). Why is
>> using cpu_quota() alone not enough? Why did we choose the current
>> approach of considering both cpu_quota() and cpu_shares()?
> Kubernetes has a concept of "cpu requests" and "cpu limit". It maps (or
> mapped?) those values to cpu shares and cpu quota in cgroups.
>
>> My guess is that sometimes people don't limit the actual number of CPUs
>> per container, but instead use CPU Shares to set the relative scheduling
>> priority between containers.
>>
>> I.e., they run "docker run --cpu-shares=1234" without using the "--cpus"
>> flag.
>>
>> If this is indeed the reason, I can understand the (good) intention, but
>> the solution seems awfully insufficient.
>>
>> CPU Shares is a *relative* number. How much CPU is allocated to you
>> depends on
>>
>> - how many other processes are actively running
>> - what their CPU Shares are
>>
>> The above information can change dynamically, as other processes may be
>> added or removed, and they can change between active and idle states.
>>
>> However, the JVM treats CPU Shares as an *absolute/static* number, and
>> sets the CPU quota of the current process using this very simplistic
>> formula.
>>
>> Value of /sys/fs/cgroup/cpu.shares -> cpu quota:
>>
>> 1023 -> 1 CPU
>> 1024 -> no limit (huh??)
>> 2048 -> 2 CPUs
>> 4096 -> 4 CPUs
>>
>> This seems just wrong to me. There's no way you can get a "correct"
>> result without knowing anything about other processes that are running
>> at the same time.
>>
>> The net effect is when Java is running under a container, more likely
>> that not, the JVM will limit itself to a single CPU. This seems really
>> inefficient to me.
> I believe the point is that popular container orchestration frameworks
> use the cpu requests feature to map to cpu.shares. A similar question
> regarding this was asked by myself a while ago. See JDK-8216366.
>
> Here is what Bob Vandette had to say at the time:
> http://mail.openjdk.java.net/pipermail/hotspot-dev/2019-January/036093.html
To quote Bob's reply from the above e-mail:
Although the value for cpu-shares can be set to
any of the values that you mention, we decided to
follow the convention set by Kubernetes and other container
orchestration products that use 1024 as the unit for
cpu shares. Ignoring the cpu shares in this case is
not what users of this popular technology
want.
https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu
• The spec.containers[].resources.requests.cpu is converted to
its core value, which is potentially fractional, and multiplied
by 1024. The greater of this number or 2 is used as the value
of the --cpu-shares flag in the docker run command.
• The spec.containers[].resources.limits.cpu is converted
to its millicore value and multiplied by 100. The resulting
value is the total amount of CPU time that a container can use
every 100ms. A container cannot use more than its share of
CPU time during this interval.
As I mentioned above, Bob's conclusion that cpu.shares should be used
as an upper limit value was probably based on the misunderstanding
of what resources.requests.cpu means in Kubernetes.
With resources.requests.cpu = 1.0, docker runs with --cpu-shares=1024
This means "I need at least 1 CPU to execute".
However, JVM incorrectly treats this as "I promise I will not used
more than 1 CPU'.
Thanks
- Ioi
> Thanks,
> Severin
>
>> What should we do?
>>
>> Thanks
>> - Ioi
>>
>> [1]https://bugs.openjdk.java.net/browse/JDK-8281181
>> [2]https://bugs.openjdk.java.net/browse/JDK-8279484
>>
More information about the hotspot-dev
mailing list