[RFC containers] 8281181 JDK's interpretation of CPU Shares causes underutilization

Mon Feb 7 04:16:30 UTC 2022

On 2/3/2022 3:29 AM, Severin Gehwolf wrote:
> Hi Ioi,
>
> On Wed, 2022-02-02 at 23:30 -0800, Ioi Lam wrote:
>> Please see the bug report [1] for detailed description and test cases.
>>
>> I'd like to have some discussion before we can decide what to do.
>>
>> I discovered this issue when analyzing JDK-8279484 [2]. Under Kubernetes
>> (minikube), Runtime.availableProcessors() returns 1, despite that the
>> fact the machine has 32 CPUs, the Kubernetes node has a single
>> deployment, and no CPU limits were set.
>  From looking at the bug it would be good to know why a cpu.weight value
> of 1 is being obverved. The default is 100. I.e. if it is really unset:
>
> $ sudo docker run --rm -v $(pwd)/jdk17:/opt/jdk:z fedora:35 /opt/jdk/bin/java -Xlog:os+container=trace --version
> [0.000s][trace][os,container] OSContainer::init: Initializing Container Support
> [0.001s][debug][os,container] Detected cgroups v2 unified hierarchy
> [0.001s][trace][os,container] Path to /memory.max is /sys/fs/cgroup//memory.max
> [0.001s][trace][os,container] Raw value for memory limit is: max
> [0.001s][trace][os,container] Memory Limit is: Unlimited
> [0.001s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup//cpu.max
> [0.001s][trace][os,container] Raw value for CPU quota is: max
> [0.001s][trace][os,container] CPU Quota is: -1
> [0.001s][trace][os,container] Path to /cpu.max is /sys/fs/cgroup//cpu.max
> [0.001s][trace][os,container] CPU Period is: 100000
> [0.001s][trace][os,container] Path to /cpu.weight is /sys/fs/cgroup//cpu.weight
> [0.001s][trace][os,container] Raw value for CPU shares is: 100
> [0.001s][debug][os,container] CPU Shares is: -1
> [0.001s][trace][os,container] OSContainer::active_processor_count: 4
> [0.001s][trace][os,container] CgroupSubsystem::active_processor_count (cached): 4
> [0.001s][debug][os,container] container memory limit unlimited: -1, using host value
> [0.001s][debug][os,container] container memory limit unlimited: -1, using host value
> [0.002s][trace][os,container] CgroupSubsystem::active_processor_count (cached): 4
> [0.007s][debug][os,container] container memory limit unlimited: -1, using host value
> [0.014s][trace][os,container] CgroupSubsystem::active_processor_count (cached): 4
> [0.022s][trace][os,container] Path to /memory.max is /sys/fs/cgroup//memory.max
> [0.022s][trace][os,container] Raw value for memory limit is: max
> [0.022s][trace][os,container] Memory Limit is: Unlimited
> [0.022s][debug][os,container] container memory limit unlimited: -1, using host value
> openjdk 17.0.2-internal 2022-01-18
> OpenJDK Runtime Environment (build 17.0.2-internal+0-adhoc.sgehwolf.jdk17u)
> OpenJDK 64-Bit Server VM (build 17.0.2-internal+0-adhoc.sgehwolf.jdk17u, mixed mode, sharing)

In JDK-8279484, the JVM is launched by Kubernetes, which manages CPU 
resources with the concept of "request" and "limit".

https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/

Quote: "Containers cannot use more CPU than the configured limit.
         Provided the system has CPU time free, a container
         is guaranteed to be allocated as much CPU as it requests."

So "CPU request" is a guaranteed minimum. For example, if you have a 
container that requests 6 CPUs, but all the hosts in your cluster have 
no more than 4 CPUs each, then this container will never be deployed by 
Kubernetes, because the minimum of 6 CPUs cannot be guaranteed.

Consider the following 4 cases:

(1) You specify both "cpu request" and "cpu limit"

(2) You specify only "cpu limit" -> Kubernetes will set the
     "cpu request" to be the same as the limit.

(3) If you specify only "cpu request", Kubernetes will set
     the "cpu limit" to a default value that's not smaller
     than the request.

(4) Neither "cpu request" nor "cpu limit" is set

(For details about the defaults, see 
https://kubernetes.io/docs/tasks/administer-cluster/manage-resources/cpu-default-namespace/ 
)

In the first 3 cases, the JVM (in cgroupv1)  will see that both 
cpu.cfs_quota_us and cpu.shares are set. The cpu.shares will be ignored 
(due to the PreferContainerQuotaForCPUCount flag. See JDK-8197867).

Case (4) is the cause for the bug in JDK-8279484

Kubernetes set the cpu.cfs_quota_us to 0 (no limit) and cpu.shares to 2. 
This means:

- This container is guaranteed a minimum amount of CPU resources
- If no other containers are executing, this container can use as
   much CPU as available on the host
- If other containers are executing, the amount of CPU available
   to this container is (2 / (sum of cpu.shares of all active
   containers))

The fundamental problem with the current JVM implementation is that it 
treats "CPU request" as a maximum value, the opposite of what Kubernetes 
does. Because of this, in case (4), the JVM artificially limits itself 
to a single CPU. This leads to CPU underutilization.

>> Specifically, I want to understand why the JDK is using
>> CgroupSubsystem::cpu_shares() to limit the number of CPUs used by the
>> Java process.
> TLDR: Kubernetes and/or other container orchestration frameworks? That
> was back in the day of cgroups v1, though.
>
>> In cgroup, there are other ways that are designed specifically for
>> limiting the number of CPUs, i.e., CgroupSubsystem::cpu_quota(). Why is
>> using cpu_quota() alone not enough? Why did we choose the current
>> approach of considering both cpu_quota() and cpu_shares()?
> Kubernetes has a concept of "cpu requests" and "cpu limit". It maps (or
> mapped?) those values to cpu shares and cpu quota in cgroups.
>
>> My guess is that sometimes people don't limit the actual number of CPUs
>> per container, but instead use CPU Shares to set the relative scheduling
>> priority between containers.
>>
>> I.e., they run "docker run --cpu-shares=1234" without using the "--cpus"
>> flag.
>>
>> If this is indeed the reason, I can understand the (good) intention, but
>> the solution seems awfully insufficient.
>>
>> CPU Shares is a *relative* number. How much CPU is allocated to you
>> depends on
>>
>> - how many other processes are actively running
>> - what their CPU Shares are
>>
>> The above information can change dynamically, as other processes may be
>> added or removed, and they can change between active and idle states.
>>
>> However, the JVM treats CPU Shares as an *absolute/static* number, and
>> sets the CPU quota of the current process using this very simplistic
>> formula.
>>
>> Value of /sys/fs/cgroup/cpu.shares -> cpu quota:
>>
>>       1023 -> 1 CPU
>>       1024 -> no limit (huh??)
>>       2048 -> 2 CPUs
>>       4096 -> 4 CPUs
>>
>> This seems just wrong to me. There's no way you can get a "correct"
>> result without knowing anything about other processes that are running
>> at the same time.
>>
>> The net effect is when Java is running under a container, more likely
>> that not, the JVM will limit itself to a single CPU. This seems really
>> inefficient to me.
> I believe the point is that popular container orchestration frameworks
> use the cpu requests feature to map to cpu.shares. A similar question
> regarding this was asked by myself a while ago. See JDK-8216366.
>
> Here is what Bob Vandette had to say at the time:
> http://mail.openjdk.java.net/pipermail/hotspot-dev/2019-January/036093.html

To quote Bob's reply from the above e-mail:

     Although the value for cpu-shares can be set to
     any of the values that you mention, we decided to
     follow the convention set by Kubernetes and other container
     orchestration products that use 1024 as the unit for
     cpu shares.  Ignoring the cpu shares in this case is
     not what users of this popular technology
     want.

https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu

     • The spec.containers[].resources.requests.cpu is converted to
       its core value, which is potentially fractional, and multiplied
       by 1024. The greater of this number or 2 is used as the value
       of the --cpu-shares flag in the docker run command.
     • The spec.containers[].resources.limits.cpu is converted
       to its millicore value and multiplied by 100. The resulting
       value is the total amount of CPU time that a container can use
       every 100ms. A container cannot use more than its share of
       CPU time during this interval.

As I mentioned above, Bob's conclusion that cpu.shares should be used
as an upper limit value was probably based on the misunderstanding
of what resources.requests.cpu means in Kubernetes.

With resources.requests.cpu = 1.0, docker runs with --cpu-shares=1024

This means "I need at least 1 CPU to execute".

However, JVM incorrectly treats this as "I promise I will not used
more than 1 CPU'.

Thanks
- Ioi

> Thanks,
> Severin
>
>> What should we do?
>>
>> Thanks
>> - Ioi
>>
>> [1]https://bugs.openjdk.java.net/browse/JDK-8281181
>> [2]https://bugs.openjdk.java.net/browse/JDK-8279484
>>