[RFC containers] 8281181 JDK's interpretation of CPU Shares causes underutilization

Tue Feb 15 05:50:10 UTC 2022

Trimming ...

On 15/02/2022 3:20 pm, Ioi Lam wrote:
> On 2/13/2022 11:02 PM, David Holmes wrote:
>> On 14/02/2022 4:07 pm, Ioi Lam wrote:
>>> I think we have a general problem that's not specific to containers. 
>>> If we are running 50 active Java processes on a bare-bone Linux, then 
>>> each of them would be default use  a 50-thread ForkJoinPool. In each 
>>> process is given an equal amount of CPU resources, it would make 
>>> sense for each of them to have a single thread FJP so we can avoid 
>>> all thread context switching.
>>
>> The JVM cannot optimise this situation because it has no knowledge of 
>> the system, its load, or the workload characteristics. It also doesn't 
>> know how the scheduler may apportion CPU resources. Sizing heuristics 
>> within the JDK itself are pretty basic. If the user/deployer has 
>> better knowledge of what would constitute an "optimum" configuration 
>> then they have control knobs (system properties, VM flags) they can 
>> use to implement that.
>>
>>> Or, maybe the Linux kernel is already good enough? If each process is 
>>> bound to a single physical CPU, context switching between the threads 
>>> of the same process should be pretty lightweight. It would be 
>>> worthwhile writing a test case ....
>>
>> Binding a process to a single CPU would be potentially very bad for 
>> some workloads. Neither end-point is likely to be "best" in general.
>>
> 
> I found some interesting numbers. I think this means we don't accomplish 
> much by restricting the size of thread pools from a relatively small 
> number (the number of physical CPUs, 3 digit or less) to an even smaller 
> number computed by CgroupSubsystem::active_processor_count().
> 
> https://eli.thegreenplace.net/2018/measuring-context-switching-and-memory-overheads-for-linux-threads/ 
> 
> 
> <quote>
> [Cost for each context switch is] somewhere between 1.2 and 1.5 
> microseconds per context switch ... Is 1-2 us a long time? As I have 
> mentioned in the post on launch overheads, a good comparison is memcpy, 
> which takes 3 us for 64 KiB on the same machine. In other words, a 
> context switch is a bit quicker than copying 64 KiB of memory from one 
> location to another.
> ...
> Conclusion
> The numbers reported here paint an interesting picture on the state of 
> Linux multi-threaded performance in 2018. I would say that the limits 
> still exist - running a million threads is probably not going to make 
> sense; however, the limits have definitely shifted since the past, and a 
> lot of folklore from the early 2000s doesn't apply today. On a beefy 
> multi-core machine with lots of RAM we can easily run 10,000 threads in 
> a single process today, in production.
> </quote>

I agree that the under-utilization caused by the way shares is currently 
used is bad. But I don't see how the above really relates to that at 
all. The above is primarily about the RAM cost of threads - and I agree 
it's better now than it used to be, so a system can support many more 
threads than it used to. But the main issue with sizing thread pools etc 
is about effective servicing of load to either achieve throughput or 
response time goals. Too many threads, just like have too many of any 
kind of worker, can be very inefficient when they just get in each 
others way.

Cheers,
David
-----

> So after the proposed change, some users may be surprised, "why do I now 
> have 32 threads sleeping inside my containerized app", but the actual 
> CPU/memory cost would be minimal, with a large potential up side -- the 
> app can run much faster when the rest of the system is quiet.
> 
> (I ran a small test on Linux x64 and the cost per Java thread is about 
> 90KB).
> 
> Thanks
> - Ioi
>