[RFC containers] 8281181 JDK's interpretation of CPU Shares causes underutilization

Tue Feb 15 05:20:54 UTC 2022

On 2/13/2022 11:02 PM, David Holmes wrote:
> On 14/02/2022 4:07 pm, Ioi Lam wrote:
>> On 2/8/2022 3:32 AM, Severin Gehwolf wrote:
>>> On Mon, 2022-02-07 at 22:29 -0800, Ioi Lam wrote:
>>>> On 2022/02/07 10:36, Severin Gehwolf wrote:
>>>>> On Sun, 2022-02-06 at 20:16 -0800, Ioi Lam wrote:
>>>>>> Case (4) is the cause for the bug in JDK-8279484
>>>>>>
>>>>>> Kubernetes set the cpu.cfs_quota_us to 0 (no limit) and 
>>>>>> cpu.shares to 2.
>>>>>> This means:
>>>>>>
>>>>>> - This container is guaranteed a minimum amount of CPU resources
>>>>>> - If no other containers are executing, this container can use as
>>>>>>      much CPU as available on the host
>>>>>> - If other containers are executing, the amount of CPU available
>>>>>>      to this container is (2 / (sum of cpu.shares of all active
>>>>>>      containers))
>>>>>>
>>>>>>
>>>>>> The fundamental problem with the current JVM implementation is 
>>>>>> that it
>>>>>> treats "CPU request" as a maximum value, the opposite of what 
>>>>>> Kubernetes
>>>>>> does. Because of this, in case (4), the JVM artificially limits 
>>>>>> itself
>>>>>> to a single CPU. This leads to CPU underutilization.
>>>>> I agree with your analysis. Key point is that in such a setup
>>>>> Kubernetes sets CPU shares value to 2. Though, it's a very specific
>>>>> case.
>>>>>
>>>>> In contrast to Kubernetes the JVM doesn't have insight into what 
>>>>> other
>>>>> containers are doing (or how they are configured). It would, perhaps,
>>>>> be good to know what Kubernetes does for containers when the
>>>>> environment (i.e. other containers) changes. Do they get restarted?
>>>>> Restarted with different values for cpu shares?
>>>> My understanding is that Kubernetes will try to do load balancing and
>>>> may migrate the containers. According to this:
>>>>
>>>> https://stackoverflow.com/questions/64891872/kubernetes-dynamic-configurationn-of-cpu-resource-limit 
>>>>
>>>>
>>>> If you change the CPU limits, a currently running container will be 
>>>> shut
>>>> down and restarted (using the new limit), and may be relocated to a
>>>> different host if necessary.
>>>>
>>>> I think this means that a JVM process doesn't need to worry about the
>>>> CPU limit changing during its lifetime :-)
>>>>> Either way, what are our options to fix this? Does it need fixing?
>>>>>
>>>>>    * Should we no longer take cpu shares as a means to limit CPU into
>>>>>      account? It would be a significant change to how previous JDKs
>>>>>      worked. Maybe that wouldn't be such a bad idea :)
>>>> I think we should get rid of it. This feature was designed to work 
>>>> with
>>>> Kubernetes, but has no effect in most cases. The only time it takes
>>>> effect (when no resource limits are set) it does the opposite of what
>>>> the user expects.
>>> I tend to agree. We should start with a CSR review of this, though, as
>>> it would be a behavioural change as compared to previous versions of
>>> the JDK.
>>
>> Hi Severin,
>>
>> Sorry for the delay. I've created a CSR. Could you take a look?
>>
>> https://bugs.openjdk.java.net/browse/JDK-8281571
>>
>>>
>>>> Also, the current implementation is really tied to specific 
>>>> behaviors of
>>>> Kubernetes + docker (the 1024 and 100 constants). This will cause
>>>> problems with other container/orchestration software that use 
>>>> different
>>>> algorithms and constants.
>>> There are other container orchestration frameworks, like Mesos, which
>>> behave in a similar way (1024 constant is being used). The good news is
>>> that mesos seems to have moved to a hard-limit default. See:
>>>
>>> https://mesos.apache.org/documentation/latest/quota/#deprecated-quota-guarantees 
>>>
>>>
>>>>>    * How likely is CPU underutilization to happen in practise?
>>>>>      Considering the container is not the only container on the node,
>>>>>      then according to your formula, it'll get one CPU or less 
>>>>> anyway.
>>>>>      Underutilization would, thus, only happen when it's an idle node
>>>>>      with no other containers running. That would suggest to do 
>>>>> nothing
>>>>>      and let the user override it as they see fit.
>>>> I think under utilization happens when the containers have a bursty
>>>> usage pattern. If other containers do not fully utilize their CPU
>>>> quotas, we should distribute the unused CPUs to the busy containers.
>>> Right, but this isn't really something the JVM process should care
>>> about. It's really a core feature of the orchestration framework to do
>>> that. All we could do is to not limit CPU for those cases. On the other
>>> hand there is the risk of resource starvation too. Consider a node with
>>> many cores, 50 say, and a very small cpu share setting via container
>>> limits. The experience running a JVM application in such a set up would
>>> be very mediocre as the JVM thinks it can use 50 cores (100% of the
>>> time), yet it would only get this when the rest of the
>>> containers/universe is idle.
>>
>> I think we have a general problem that's not specific to containers. 
>> If we are running 50 active Java processes on a bare-bone Linux, then 
>> each of them would be default use  a 50-thread ForkJoinPool. In each 
>> process is given an equal amount of CPU resources, it would make 
>> sense for each of them to have a single thread FJP so we can avoid 
>> all thread context switching.
>
> The JVM cannot optimise this situation because it has no knowledge of 
> the system, its load, or the workload characteristics. It also doesn't 
> know how the scheduler may apportion CPU resources. Sizing heuristics 
> within the JDK itself are pretty basic. If the user/deployer has 
> better knowledge of what would constitute an "optimum" configuration 
> then they have control knobs (system properties, VM flags) they can 
> use to implement that.
>
>> Or, maybe the Linux kernel is already good enough? If each process is 
>> bound to a single physical CPU, context switching between the threads 
>> of the same process should be pretty lightweight. It would be 
>> worthwhile writing a test case ....
>
> Binding a process to a single CPU would be potentially very bad for 
> some workloads. Neither end-point is likely to be "best" in general.
>

I found some interesting numbers. I think this means we don't accomplish 
much by restricting the size of thread pools from a relatively small 
number (the number of physical CPUs, 3 digit or less) to an even smaller 
number computed by CgroupSubsystem::active_processor_count().

https://eli.thegreenplace.net/2018/measuring-context-switching-and-memory-overheads-for-linux-threads/

<quote>
[Cost for each context switch is] somewhere between 1.2 and 1.5 
microseconds per context switch ... Is 1-2 us a long time? As I have 
mentioned in the post on launch overheads, a good comparison is memcpy, 
which takes 3 us for 64 KiB on the same machine. In other words, a 
context switch is a bit quicker than copying 64 KiB of memory from one 
location to another.
...
Conclusion
The numbers reported here paint an interesting picture on the state of 
Linux multi-threaded performance in 2018. I would say that the limits 
still exist - running a million threads is probably not going to make 
sense; however, the limits have definitely shifted since the past, and a 
lot of folklore from the early 2000s doesn't apply today. On a beefy 
multi-core machine with lots of RAM we can easily run 10,000 threads in 
a single process today, in production.
</quote>

So after the proposed change, some users may be surprised, "why do I now 
have 32 threads sleeping inside my containerized app", but the actual 
CPU/memory cost would be minimal, with a large potential up side -- the 
app can run much faster when the rest of the system is quiet.

(I ran a small test on Linux x64 and the cost per Java thread is about 
90KB).

Thanks
- Ioi