[RFC containers] 8281181 JDK's interpretation of CPU Shares causes underutilization

Mon Feb 14 06:07:16 UTC 2022

On 2/8/2022 3:32 AM, Severin Gehwolf wrote:
> On Mon, 2022-02-07 at 22:29 -0800, Ioi Lam wrote:
>> On 2022/02/07 10:36, Severin Gehwolf wrote:
>>> On Sun, 2022-02-06 at 20:16 -0800, Ioi Lam wrote:
>>>> Case (4) is the cause for the bug in JDK-8279484
>>>>
>>>> Kubernetes set the cpu.cfs_quota_us to 0 (no limit) and cpu.shares to 2.
>>>> This means:
>>>>
>>>> - This container is guaranteed a minimum amount of CPU resources
>>>> - If no other containers are executing, this container can use as
>>>>      much CPU as available on the host
>>>> - If other containers are executing, the amount of CPU available
>>>>      to this container is (2 / (sum of cpu.shares of all active
>>>>      containers))
>>>>
>>>>
>>>> The fundamental problem with the current JVM implementation is that it
>>>> treats "CPU request" as a maximum value, the opposite of what Kubernetes
>>>> does. Because of this, in case (4), the JVM artificially limits itself
>>>> to a single CPU. This leads to CPU underutilization.
>>> I agree with your analysis. Key point is that in such a setup
>>> Kubernetes sets CPU shares value to 2. Though, it's a very specific
>>> case.
>>>
>>> In contrast to Kubernetes the JVM doesn't have insight into what other
>>> containers are doing (or how they are configured). It would, perhaps,
>>> be good to know what Kubernetes does for containers when the
>>> environment (i.e. other containers) changes. Do they get restarted?
>>> Restarted with different values for cpu shares?
>> My understanding is that Kubernetes will try to do load balancing and
>> may migrate the containers. According to this:
>>
>> https://stackoverflow.com/questions/64891872/kubernetes-dynamic-configurationn-of-cpu-resource-limit
>>
>> If you change the CPU limits, a currently running container will be shut
>> down and restarted (using the new limit), and may be relocated to a
>> different host if necessary.
>>
>> I think this means that a JVM process doesn't need to worry about the
>> CPU limit changing during its lifetime :-)
>>> Either way, what are our options to fix this? Does it need fixing?
>>>
>>>    * Should we no longer take cpu shares as a means to limit CPU into
>>>      account? It would be a significant change to how previous JDKs
>>>      worked. Maybe that wouldn't be such a bad idea :)
>> I think we should get rid of it. This feature was designed to work with
>> Kubernetes, but has no effect in most cases. The only time it takes
>> effect (when no resource limits are set) it does the opposite of what
>> the user expects.
> I tend to agree. We should start with a CSR review of this, though, as
> it would be a behavioural change as compared to previous versions of
> the JDK.

Hi Severin,

Sorry for the delay. I've created a CSR. Could you take a look?

https://bugs.openjdk.java.net/browse/JDK-8281571

>
>> Also, the current implementation is really tied to specific behaviors of
>> Kubernetes + docker (the 1024 and 100 constants). This will cause
>> problems with other container/orchestration software that use different
>> algorithms and constants.
> There are other container orchestration frameworks, like Mesos, which
> behave in a similar way (1024 constant is being used). The good news is
> that mesos seems to have moved to a hard-limit default. See:
>
> https://mesos.apache.org/documentation/latest/quota/#deprecated-quota-guarantees
>
>>>    * How likely is CPU underutilization to happen in practise?
>>>      Considering the container is not the only container on the node,
>>>      then according to your formula, it'll get one CPU or less anyway.
>>>      Underutilization would, thus, only happen when it's an idle node
>>>      with no other containers running. That would suggest to do nothing
>>>      and let the user override it as they see fit.
>> I think under utilization happens when the containers have a bursty
>> usage pattern. If other containers do not fully utilize their CPU
>> quotas, we should distribute the unused CPUs to the busy containers.
> Right, but this isn't really something the JVM process should care
> about. It's really a core feature of the orchestration framework to do
> that. All we could do is to not limit CPU for those cases. On the other
> hand there is the risk of resource starvation too. Consider a node with
> many cores, 50 say, and a very small cpu share setting via container
> limits. The experience running a JVM application in such a set up would
> be very mediocre as the JVM thinks it can use 50 cores (100% of the
> time), yet it would only get this when the rest of the
> containers/universe is idle.

I think we have a general problem that's not specific to containers. If 
we are running 50 active Java processes on a bare-bone Linux, then each 
of them would be default use  a 50-thread ForkJoinPool. In each process 
is given an equal amount of CPU resources, it would make sense for each 
of them to have a single thread FJP so we can avoid all thread context 
switching.

Or, maybe the Linux kernel is already good enough? If each process is 
bound to a single physical CPU, context switching between the threads of 
the same process should be pretty lightweight. It would be worthwhile 
writing a test case ....

Thanks
- Ioi

>
> Thanks,
> Severin
>