Better default for ParallelGCThreads and ConcGCThreads by using number of physical cores and CPU mask.

Wed Jan 15 20:25:01 UTC 2014

Hello,

I figured that there are no x64 Chips which have CMT>2, but architectures  
like power and sparc do have Threads/Core > 2. Maybe this code can be  
globalized and having the threading divisor beeing provided by  
architecture specific code (which could be in case of x64 be a function  
which returns 2 or 1 (if HT is off)). For AIX/P7 it could be  
SMT1/SMT2/SMT4. For T3 there are up to 8 hardware threads, but I am not  
sure if they need the same reduction or if the architecture can better use  
that capacity while heapwalking.

(Another option to approaching this formular is BTW to ask the platform  
specific layer not for the number of logical CPUs but directly for the  
number of cores for the current resourcepool/node/set.)

(I wonder if it makes sense to use a divisor of 1 on some hypervisors,  
even when the emulated VCPU announces HT?)

Greetings
Bernd

Am 15.01.2014, 19:52 Uhr, schrieb Jon Masamitsu <jon.masamitsu at oracle.com>:

>
> On 1/15/2014 4:51 AM, Bengt Rutisson wrote:
>>
>> On 2014-01-13 22:39, Jungwoo Ha wrote:
>>>
>>>
>>>     In CMSCollector there is still this code to change the value for
>>>     ConcGCThreads based on AdjustGCThreadsToCores.
>>>
>>>
>>>      639       if (AdjustGCThreadsToCores) {
>>>      640         FLAG_SET_DEFAULT(ConcGCThreads, ParallelGCThreads /  
>>> 2);
>>>      641       } else {
>>>      642         FLAG_SET_DEFAULT(ConcGCThreads, (3 +
>>>     ParallelGCThreads) / 4);
>>>      643       }
>>>
>>>     Do you think that is needed or can we use the same logic in both
>>>     cases given that ParallelGCThreads has a different value if
>>>     AdjustGCThreadsToCores is enabled.
>>>
>>>
>>> I am happy to just use FLAG_SET_DEFAULT(ConcGCThreads,
>>> ParallelGCThreads / 2);
>>> The original hotspot code used FLAG_SET_DEFAULT(ConcGCThreads, (3 +
>>> ParallelGCThreads) / 4); which I think is somewhat arbitrary.
>>> Now that ParallelGCThreads will reduce on some configuration,
>>> dividing it into 4 seems to make the ConcGCThreads too small.
>>
>> Hm. Changing to FLAG_SET_DEFAULT(ConcGCThreads, ParallelGCThreads / 2)
>> might be the way to go, but I think that should probably done as a
>> separate change. That way we can performance test it more thoroughly.
>>
>>>
>>>
>>>     Also, I don't fully understand the name AdjustGCThreadsToCores.
>>>     In VM_Version::calc_parallel_worker_threads() for x86 we simply
>>>     active_core_count with 2 if this flag is enabled. So, the flag
>>>     does not really adjust to the cores. It seems like it is reduces
>>>     the number of GC threads. How about calling the flag
>>>     ReduceGCThreads or something like that?
>>>
>>>
>>> The flag can be named better. However, ReduceGCThreads doesn't seem
>>> to reflect what this flag does.
>>> I am pretty bad at naming, so let me summarize what this flag is
>>> actually doing.
>>>
>>> The flag adjusts the GC threads to the number of "available" physical
>>> cores reported by /proc filesystem and the CPU mask set by
>>> sched_setaffinity.
>>> For example, ParallelGCThreads will remain the same regardless of
>>> whether hyperthreading is turned on/off.
>>> Current hotspot code will have twice more GC threads if
>>> hyperthreading is on.
>>> Usually, GC causes huge number of cache misses, thus having two GC
>>> threads competing for the same physical core hurts the GC throughput.
>>> Current hotspot code doesn't consider CPU mask at all.
>>> For example, even though the machine has 64 cores, if CPU mask is set
>>> for 2 cores, current hotspot calculates the number of GC threads
>>> based on 64.
>>> Thus, this flag is actually evaluating the number of GC threads to
>>> the number of physical cores available for the JVM process.
>>
>> Right. In VM_Version::calc_parallel_worker_threads() we take the value
>> of os::active_core_count() and divide it by 2. I guess this is to
>> reduce the cache issues. But if the flag is called
>> AdjustGCThreadsToCores I would have expected that we set the number of
>> GC threads to be equal to the core count. That's why I suggested
>> "Reduce" in the name.
>>
>> Naming is hard and I am not particularly fond of the name
>> ReduceGCThreads either. But maybe we can try to come up with something
>> else?
>
> How about ScaleGCThreadsByCores?
>
> Jon
>
>>
>>>
>>>     I think I pointed this out earlier, but I don't feel comfortable
>>>     reviewing the changes in os_linux_x86.cpp. I hope someone from
>>>     the Runtime team can review that.
>>>
>>>
>>> Can you clarify what you meant? /proc & cpu mask is dependent on
>>> Linux & x86, and I only tested on that platform.
>>> The assumptions I used here is based on the x86 cache architecture.
>>
>> What I was trying to say was that I don't know enough about Linux to
>> be confident that your implementation of os::active_core_count() is
>> the simplest and most stable way to retrieve that information. I'm
>> sure it is good, I am just not the right person to review this piece
>> of the code. That's why I think it would be good if someone from the
>> Runtime team looked at this.
>>
>> Thanks,
>> Bengt
>>
>>
>>>
>>> Jungwoo
>>>
>>
>

-- 
http://bernd.eckenfels.net