[Containers] Reasoning for cpu shares limits

Mon Jan 7 15:24:18 UTC 2019

> On Jan 7, 2019, at 5:31 AM, Severin Gehwolf <sgehwolf at redhat.com> wrote:
> 
> Hi Bob,
> 
> Thanks for your response!
> 
> On Fri, 2019-01-04 at 17:34 -0500, Bob Vandette wrote:
>> Hi Severin,
>> 
>> There has been much debate on the best algorithm for selecting the number of CPUs that is
>> reported by the Java Runtime when running in containers.
> 
> I can imagine. I'm wondering whether all aspects have been properly
> considered, though.
> 

Given that there is no perfect answer here,  I’ve tried to come up with a solution that at least
supports the most popular Cloud use cases.  Cgroups pre-dated Kubernetes which used whatever
facilities were available to help in its host resource allocation.

I looked through the source for Mesos and found that it uses the same 1024/CPU share 
convention.  Oracle cloud and AWS both use this convention.  If it were not for this popular
convention, I would have had no choice than to ignore the share value since, as you state,
there is no way to know the relative value on container has over another.  k8s however does
use cpu requests (cpu-shares) in order to ensure that a Pod does not exceed the number of available resources.

>> Although the value for cpu-shares can be set to any of the values that you mention, we decided to
>> follow the convention set by Kubernetes and other container orchestration products that use 1024 as
>> the unit for cpu shares.  Ignoring the cpu shares in this case is not what users of this popular technology
>> want.
> 
> Why not? A '--cpu-shares=X' setting does not imply JVM internal CPU
> limits AFAIK. Consider 3 JVM containers running on a node/host vs. 2
> JVM containers running on a node/host with the *same* --cpu-shares
> setting.

If under k8s you specifiy 2 cpus requests for each, the system designer intentionally
wanted to have an impact on the process running in the container.  Doing nothing
would cause the VM to configure its thread pools to use more than was intended
resulting in thrashing under a heavily loaded system.

> 
> Effectively, after JDK-8197589, cpu shares value being ignored by the
> JVM is what's happening. That's what I'm seeing for JVM containers on
> k8s anyway.

cpu-shares are only ignored if there is no cpu-quota set.  I have no way of knowing if it
is common to have cpu requests without cpu limits but it is possible.

Here’s more detail on what cpu requests and limits mean to k8s.

Pod scheduling is based on requests. A Pod is scheduled to run on a Node only if the Node has
enough CPU resources available to satisfy the Pod CPU request.

https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/#specify-a-cpu-request-that-is-too-big-for-your-nodes

> 
>> 
>> https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#meaning-of-cpu
>> 
>> 	• The spec.containers[].resources.requests.cpu is converted to its core value, which is potentially fractional, and multiplied by 1024. The greater of this number or 2 is used as the value of the --cpu-shares flag in the docker run command.
>> 	• The spec.containers[].resources.limits.cpu is converted to its millicore value and multiplied by 100. The resulting value is the total amount of CPU time that a container can use every 100ms. A container cannot use more than its share of CPU time during this interval.
>> 
>> There are a few options that can be used if our default behavior doesn’t work for you.
>> 
>> 1. Use quotas in addition to or instead of shares.
>> 2. Specify -XX:ActiveProcessorCount=value
> 
> OK. So it's modelled after how Kubernetes does things. What I'm
> questioning is whether the spec.containers[].resources.requests.cpu
> setting of Kubernetes should have any bearing on the number of CPUs the
> *JVM* thinks are available to it, though. It's still just a relative
> weight a JVM-based container would get. What if k8s decides to use a
> different magic number? Should this be hard-coded in the JVM? Should
> this be used in the JVM at all?
> 
> Taking the Kubernetes case, it'll usually set CPU shares *and* CPU
> quota.  The latter very likely being the higher value as k8s models
> spec.containers[].resources.requests.cpu as a sort of minimal CPU value
> and spec.containers[].resources.limits.cpu as a maximum, hard limit. In
> that respect, having CPU shares' value modelled by the k8s case *within
> the JVM* seems arbitrary as it won't be used anyway. Quotas take
> precedence. Perhaps that's why JDK-8197589 was done after JDK-8146115?
> 
> I'd argue that:
> 
> A) Modelling this after the k8s case and enforcing a CPU limit
>   (within the JVM) based on a relative weight is still wrong. The
>   common case for k8s is both settings, shares and quota, being
>   present. After JDK-8197589, there is even a preference to use
>   quota over CPU shares. I'd argue PreferContainerQuotaForCPUCount
>   JVM switch wouldn't be needed if CPU shares wouldn't have any
>   effect on the internal JVM settings to begin with.
> B) It breaks other frameworks which don't use this convention for no
>   good reason. Cloudfoundry is a case in point.
> C) This needs to be at least documented in code as to why that decision
>   has been made. Specifically "#define PER_CPU_SHARES 1024" in
>   src/hotspot/os/linux/osContainer_linux.cpp.

I agree that PER_CPU_SHARES should have a comment documenting its
meaning and origin.

> 
> As to the possible work-arounds:
> 
> "Use quotas in addition to or instead of shares":
> 
> I'd argue that's not an option for most (all?) use-cases. CPU quotas
> are stable, not relying on other containers running on a node/host. CPU
> shares, on the other hand, are just a relative weight and largely
> depend on the number of other containers running on the same node/host.
> That's something external to the JVM, so it can't possibly know which
> value it should use. CPU quotas, IMO, make sense to have a baring on
> the JVMs internal settings as those settings are documented by the CFS
> bandwitdh control doc (see examples):
> https://www.kernel.org/doc/Documentation/scheduler/sched-bwc.txt
> 
> "ActiveProcessorCount":
> 
> It's nice to have, but it needs user intervention. For one, needing to
> know that this switch exists. For two, coming up with a reasonable way
> to set this value. Anyway, it's a nice stop-gap solution if things
> don't work as intended. We should keep it.
> 
> Is there any chance using CPU shares for internal JVM purposes could be
> reconsidered? The argument that k8s uses 1024 as a scale factor isn't
> very compelling. It'll still pass it to docker via --cpu-shares, which
> is a relative weight.
> 
> I'd be happy to help and improve this situation. Thoughts?

I’d like to get some feedback from the docker, k8s and CloudFoundry
community before changing this algorithm once again.  Churning it is
almost as bad as the current situation since developers may be adapting
to the new behavior.

Bob.

> 
> Thanks,
> Severin
> 
>> Bob.
>> 
>>> On Jan 4, 2019, at 1:09 PM, Severin Gehwolf <sgehwolf at redhat.com> wrote:
>>> 
>>> Hi,
>>> 
>>> Having come across this cloud foundry issue[1], I wonder why the cgroup
>>> cpu shares' value is being used in the JVM as a heuristic for available
>>> processors.
>>> 
>>> From the man page from docker-run:
>>> 
>>> ---------------------------------------------------------
>>>      --cpu-shares=0
>>>         CPU shares (relative weight)
>>> 
>>>      By default, all containers get the same proportion of CPU cycles. This proportion can be modified by changing the container's CPU share weighting relative to the weighting of all other running
>>>      containers.
>>> 
>>>      To modify the proportion from the default of 1024, use the --cpu-shares flag to set the weighting to 2 or higher.
>>> 
>>>      The proportion will only apply when CPU-intensive processes are running.  When tasks in one container are idle, other containers can use the left-over CPU time. The actual amount of CPU time will
>>>      vary depending on the number of containers running on the system.
>>> 
>>>      For example, consider three containers, one has a cpu-share of 1024 and two others have a cpu-share setting of 512. When processes in all three containers attempt to use 100% of CPU, the first
>>>      container would receive 50% of the total CPU time. If you add a fourth container with a cpu-share of 1024, the first container only gets 33% of the CPU. The remaining containers receive 16.5%, 16.5%
>>>      and 33% of the CPU.
>>> 
>>>      On a multi-core system, the shares of CPU time are distributed over all CPU cores. Even if a container is limited to less than 100% of CPU time, it can use 100% of each individual CPU core.
>>> 
>>>      For example, consider a system with more than three cores. If you start one container {C0} with -c=512 running one process, and another container {C1} with -c=1024 running two processes, this can
>>>      result in the following division of CPU shares:
>>> 
>>>             PID    container    CPU CPU share
>>>             100    {C0}     0   100% of CPU0
>>>             101    {C1}     1   100% of CPU1
>>>             102    {C1}     2   100% of CPU2
>>> 
>>> ---------------------------------------------------------
>>> 
>>> So the cpu shares value (unlike --cpu-quota) is a relative weight.
>>> 
>>> For example, those three cpu-shares settings are equivalent (C1-C4 are
>>> containers; '-c' is a short-cut for '--cpu-shares'):
>>> 
>>> A[i]
>>> -------------
>>> C1 => -c=122
>>> C2 => -c=122
>>> C3 => -c=61
>>> C4 => -c=61
>>> 
>>> B[ii]
>>> -------------
>>> C1 => -c=1026
>>> C2 => -c=1026
>>> C3 => -c=513
>>> C4 => -c=513
>>> 
>>> C[iii]
>>> -------------
>>> C1 => -c=2048
>>> C2 => -c=2048
>>> C3 => -c=1024
>>> C4 => -c=1024
>>> 
>>> For A the container CPU heuristics will determine for the JVM to use 1
>>> CPU for C1-C4. For B and C, the container CPU heuristics will determine
>>> for the JVM to use 2 CPUs for C1 and C2 and 1 CPU for C3 and C4 which
>>> seems rather inconsistent and arbitrary. The reason this is happening
>>> is that 1024 seems to have gotten a questionable meaning in [2]. I
>>> wonder why?
>>> 
>>> The JVM cannot reasonably determine from the relative weight of --cpu-
>>> shares' value how many CPUs it should use. As it's a relative weight
>>> that's something for the container runtime to take into account. It
>>> appears to me that the container detection code should probably fall
>>> back to the host CPU value and only take CPU quotas into account.
>>> 
>>> Am I missing something obvious here? All I could find was this in JDK-
>>> 8146115:
>>> """
>>> If cpu_shares has been setup for the container, the number_of_cpus()
>>> will be calculated based on cpu_shares()/1024. 1024 is the default and
>>> standard unit for calculating relative cpu 
>>> """
>>> 
>>> "1024 is the default and standard unit for calculating relative cpu"
>>> seems a wrong assumption to me. Thoughts?
>>> 
>>> Thanks,
>>> Severin
>>> 
>>> [1]    https://github.com/cloudfoundry/java-buildpack/issues/650#issuecomment-441777166
>>> [2]    http://hg.openjdk.java.net/jdk/jdk/rev/7f22774a5f42#l4.43
>>> [i]*   http://cr.openjdk.java.net/~sgehwolf/container-resources-cpu/c122.out.log
>>>      http://cr.openjdk.java.net/~sgehwolf/container-resources-cpu/c61.out.log
>>> [ii]*  http://cr.openjdk.java.net/~sgehwolf/container-resources-cpu/c1026.out.log
>>>      http://cr.openjdk.java.net/~sgehwolf/container-resources-cpu/c513.out.log
>>> [iii]* http://cr.openjdk.java.net/~sgehwolf/container-resources-cpu/c2048.out.log
>>>      http://cr.openjdk.java.net/~sgehwolf/container-resources-cpu/c1024.out.log
>>> 
>>> * Files produced with:
>>> 
>>> $ for i in 1026 513 2048 1024 122 61; do sudo docker run -ti -c=$i --rm fedora28-jdks:v1 /jdk-head/bin/java -showversion -Xlog:os+container=trace RuntimeProc > container-resources-cpu/c${i}.out.log; done
>>> $ sudo docker run -ti --rm fedora28-jdks:v1 cat RuntimeProc.java
>>> public class RuntimeProc {
>>> 	public static void main(String[] args) {
>>> 		int availProc = Runtime.getRuntime().availableProcessors();
>>> 		System.out.println(">>> Available processors: " + availProc + " <<<<");
>>> 	}
>>> }
>>> 
>>> 
>>> 
>> 
>> 
>