[RFC containers] 8281181 JDK's interpretation of CPU Shares causes underutilization

Mon Feb 14 07:02:17 UTC 2022

On 14/02/2022 4:07 pm, Ioi Lam wrote:
> On 2/8/2022 3:32 AM, Severin Gehwolf wrote:
>> On Mon, 2022-02-07 at 22:29 -0800, Ioi Lam wrote:
>>> On 2022/02/07 10:36, Severin Gehwolf wrote:
>>>> On Sun, 2022-02-06 at 20:16 -0800, Ioi Lam wrote:
>>>>> Case (4) is the cause for the bug in JDK-8279484
>>>>>
>>>>> Kubernetes set the cpu.cfs_quota_us to 0 (no limit) and cpu.shares 
>>>>> to 2.
>>>>> This means:
>>>>>
>>>>> - This container is guaranteed a minimum amount of CPU resources
>>>>> - If no other containers are executing, this container can use as
>>>>>      much CPU as available on the host
>>>>> - If other containers are executing, the amount of CPU available
>>>>>      to this container is (2 / (sum of cpu.shares of all active
>>>>>      containers))
>>>>>
>>>>>
>>>>> The fundamental problem with the current JVM implementation is that it
>>>>> treats "CPU request" as a maximum value, the opposite of what 
>>>>> Kubernetes
>>>>> does. Because of this, in case (4), the JVM artificially limits itself
>>>>> to a single CPU. This leads to CPU underutilization.
>>>> I agree with your analysis. Key point is that in such a setup
>>>> Kubernetes sets CPU shares value to 2. Though, it's a very specific
>>>> case.
>>>>
>>>> In contrast to Kubernetes the JVM doesn't have insight into what other
>>>> containers are doing (or how they are configured). It would, perhaps,
>>>> be good to know what Kubernetes does for containers when the
>>>> environment (i.e. other containers) changes. Do they get restarted?
>>>> Restarted with different values for cpu shares?
>>> My understanding is that Kubernetes will try to do load balancing and
>>> may migrate the containers. According to this:
>>>
>>> https://stackoverflow.com/questions/64891872/kubernetes-dynamic-configurationn-of-cpu-resource-limit 
>>>
>>>
>>> If you change the CPU limits, a currently running container will be shut
>>> down and restarted (using the new limit), and may be relocated to a
>>> different host if necessary.
>>>
>>> I think this means that a JVM process doesn't need to worry about the
>>> CPU limit changing during its lifetime :-)
>>>> Either way, what are our options to fix this? Does it need fixing?
>>>>
>>>>    * Should we no longer take cpu shares as a means to limit CPU into
>>>>      account? It would be a significant change to how previous JDKs
>>>>      worked. Maybe that wouldn't be such a bad idea :)
>>> I think we should get rid of it. This feature was designed to work with
>>> Kubernetes, but has no effect in most cases. The only time it takes
>>> effect (when no resource limits are set) it does the opposite of what
>>> the user expects.
>> I tend to agree. We should start with a CSR review of this, though, as
>> it would be a behavioural change as compared to previous versions of
>> the JDK.
> 
> Hi Severin,
> 
> Sorry for the delay. I've created a CSR. Could you take a look?
> 
> https://bugs.openjdk.java.net/browse/JDK-8281571
> 
>>
>>> Also, the current implementation is really tied to specific behaviors of
>>> Kubernetes + docker (the 1024 and 100 constants). This will cause
>>> problems with other container/orchestration software that use different
>>> algorithms and constants.
>> There are other container orchestration frameworks, like Mesos, which
>> behave in a similar way (1024 constant is being used). The good news is
>> that mesos seems to have moved to a hard-limit default. See:
>>
>> https://mesos.apache.org/documentation/latest/quota/#deprecated-quota-guarantees 
>>
>>
>>>>    * How likely is CPU underutilization to happen in practise?
>>>>      Considering the container is not the only container on the node,
>>>>      then according to your formula, it'll get one CPU or less anyway.
>>>>      Underutilization would, thus, only happen when it's an idle node
>>>>      with no other containers running. That would suggest to do nothing
>>>>      and let the user override it as they see fit.
>>> I think under utilization happens when the containers have a bursty
>>> usage pattern. If other containers do not fully utilize their CPU
>>> quotas, we should distribute the unused CPUs to the busy containers.
>> Right, but this isn't really something the JVM process should care
>> about. It's really a core feature of the orchestration framework to do
>> that. All we could do is to not limit CPU for those cases. On the other
>> hand there is the risk of resource starvation too. Consider a node with
>> many cores, 50 say, and a very small cpu share setting via container
>> limits. The experience running a JVM application in such a set up would
>> be very mediocre as the JVM thinks it can use 50 cores (100% of the
>> time), yet it would only get this when the rest of the
>> containers/universe is idle.
> 
> I think we have a general problem that's not specific to containers. If 
> we are running 50 active Java processes on a bare-bone Linux, then each 
> of them would be default use  a 50-thread ForkJoinPool. In each process 
> is given an equal amount of CPU resources, it would make sense for each 
> of them to have a single thread FJP so we can avoid all thread context 
> switching.

The JVM cannot optimise this situation because it has no knowledge of 
the system, its load, or the workload characteristics. It also doesn't 
know how the scheduler may apportion CPU resources. Sizing heuristics 
within the JDK itself are pretty basic. If the user/deployer has better 
knowledge of what would constitute an "optimum" configuration then they 
have control knobs (system properties, VM flags) they can use to 
implement that.

> Or, maybe the Linux kernel is already good enough? If each process is 
> bound to a single physical CPU, context switching between the threads of 
> the same process should be pretty lightweight. It would be worthwhile 
> writing a test case ....

Binding a process to a single CPU would be potentially very bad for some 
workloads. Neither end-point is likely to be "best" in general.

Cheers,
David

> 
> Thanks
> - Ioi
> 
> 
>>
>> Thanks,
>> Severin
>>
>