[RFC containers] 8281181 JDK's interpretation of CPU Shares causes underutilization

Mon Feb 7 01:12:21 UTC 2022

Just for the record ...

On 3/02/2022 7:19 pm, David Holmes wrote:
> Hi Ioi,
> 
> For the benefit of the mailing list discussion ...
> 
> On 3/02/2022 5:30 pm, Ioi Lam wrote:
>> Please see the bug report [1] for detailed description and test cases.
>>
>> I'd like to have some discussion before we can decide what to do.
>>
>> I discovered this issue when analyzing JDK-8279484 [2]. Under 
>> Kubernetes (minikube), Runtime.availableProcessors() returns 1, 
>> despite that the fact the machine has 32 CPUs, the Kubernetes node has 
>> a single deployment, and no CPU limits were set.
>>
>> Specifically, I want to understand why the JDK is using 
>> CgroupSubsystem::cpu_shares() to limit the number of CPUs used by the 
>> Java process.
> 
> Because we were asked to by customers deploying in containers.
> 
>> In cgroup, there are other ways that are designed specifically for 
>> limiting the number of CPUs, i.e., CgroupSubsystem::cpu_quota(). Why 
>> is using cpu_quota() alone not enough? Why did we choose the current 
>> approach of considering both cpu_quota() and cpu_shares()?
> 
> Because people were using both (whether that made sense or not) and so 
> we needed a policy on what to do if both were set.
> 
>> My guess is that sometimes people don't limit the actual number of 
>> CPUs per container, but instead use CPU Shares to set the relative 
>> scheduling priority between containers.
>>
>> I.e., they run "docker run --cpu-shares=1234" without using the 
>> "--cpus" flag.
>>
>> If this is indeed the reason, I can understand the (good) intention, 
>> but the solution seems awfully insufficient.
>>
>> CPU Shares is a *relative* number. How much CPU is allocated to you 
>> depends on
>>
>> - how many other processes are actively running
>> - what their CPU Shares are
>>
>> The above information can change dynamically, as other processes may 
>> be added or removed, and they can change between active and idle states.
>>
>> However, the JVM treats CPU Shares as an *absolute/static* number, and 
>> sets the CPU quota of the current process using this very simplistic 
>> formula.
> 
>  From old discussion and the code I believe the thought was that share 
> was relative to the the per-cpu default shares of 1024. So we use that 
> to determine the fraction of each CPU that should be assigned, and we 
> should then use that to determine the available number of CPUs. But that 
> isn't what we actually do - we only calculate the fraction and round it 
> up to get the number of CPUs and that is wrong (and typically only gives 
> 1 cpu because shares < 1024). I speculate that what was intended was to 
> map from having an X% share of each CPU, to instead having access to X% 
> of the total CPUs (at 100% of each). Mathematically this has some basis 
> but it actually makes no practical sense from a throughput or response 
> time perspective. If I'm allowed 50% of the CPU per time period to do my 
> calculations, I want 100% of each CPU for half of the period as that 
> potentially minimises the elapsed time till I have a result.
> 
>> Value of /sys/fs/cgroup/cpu.shares -> cpu quota:
>>
>>      1023 -> 1 CPU
>>      1024 -> no limit (huh??)
>>      2048 -> 2 CPUs
>>      4096 -> 4 CPUs
>>
>> This seems just wrong to me. There's no way you can get a "correct" 
>> result without knowing anything about other processes that are running 
>> at the same time.
> 
> As I said above and in the bug report I think this was an error and the 
> intent was to then multiply by the number of actual processors.

Not it was not an error. See the discussion Severin referenced:

http://mail.openjdk.java.net/pipermail/hotspot-dev/2019-January/036093.html

David
-----

>> The net effect is when Java is running under a container, more likely 
>> that not, the JVM will limit itself to a single CPU. This seems really 
>> inefficient to me.
> 
> Yes.
> 
>> What should we do?
> 
> We could just adjust the calculation as I suggested.
> 
> Or, given that share aka weight is meaningless without knowing the total 
> weight in the system we could just ignore it. The app then gets access 
> to all cpu's and it is up to the container to track actual usage and 
> impose any limits configured.
> 
> I've always thought that these cgroups mechanisms were fundamentally 
> flawed and that if the intent was to define a resource limited 
> environment, then the environment should report what resources were 
> available by the normal APIs. They got this right with cpu-sets by 
> integrating with sched_getaffinity; but for shares and quotas it has 
> been left to the applications to try and figure out what that should 
> mean - and that makes no sense to me.
> 
> Cheers,
> David
> 
>> Thanks
>> - Ioi
>>
>> [1] https://bugs.openjdk.java.net/browse/JDK-8281181
>> [2] https://bugs.openjdk.java.net/browse/JDK-8279484