Better default for ParallelGCThreads and ConcGCThreads by using number of physical cores and CPU mask.

Fri Nov 22 12:33:34 UTC 2013

Hi Jungwoo,

On 2013-11-21 18:37, Jungwoo Ha wrote:
>
>
> On Thu, Nov 21, 2013 at 2:11 AM, Bengt Rutisson 
> <bengt.rutisson at oracle.com <mailto:bengt.rutisson at oracle.com>> wrote:
>
>
>     Hi Jungwoo,
>
>     Thanks for contributing this change. Nice work.
>
>     First a couple general question.
>
>     Why did you only make this change for CMS? Shouldn't the other
>     collectors benefit from this as well? You said you tried the other
>     GCs as well. What were the results for them?
>
>
> The change will affect all collectors because it is setting 
> ParalellGCThreads and ConcGCThreads.
> I mostly tested on CMS for convenience of evaluation and needs.

Right. Sorry, I missed that ParallelGCThreads is initialized by 
Abstract_VM_Version::parallel_worker_threads() that you now override. 
So, you are right, this change will affect all collectors.

But doesn't that mean that you need to do the same changes for 
ConcGCThreads for G1 as you do for CMS?

>
>     Do we need the flag AdjustGCThreadsToCores? It seems like this is
>     a good change and it is always nice to reduce the number of flags
>     we have. If you feel unsure about the change it may be good to be
>     able to disable it, but in general I think it would be better to
>     decide that this is the default behavior and go with it.
>
>
> I have no strong opinion. If you guys can confirm the improvements and 
> feel like you don't need the flag, that would be ideal.

Yes, this is an interesting dilemma. Personally I don't think I have the 
cycles right now to do any performance testing of this fix. But I assume 
that you don't have access to enough machines and benchmarks to test 
this properly on all supported platforms. I'll bring this up internally 
to see if there is any process for handling this type of situation...

Since your change is for Linux only and you have done the measurements 
there, I'm fine with your results. I would prefer skipping the 
AdjustGCThreadsToCores flag based on the data you provided. Especially 
since we can still set ParallelGCThreads explicitly on the command line 
if we need to work around a regression somewhere.

>
>     os_linux_x86.cpp
>
>     This is where the important part of your change is. I don't know
>     enough about Linux internals to validate the changes you have made
>     there. I will leave that out of my review and let others (Runtime
>     team?) with more Linux knowledge review it.
>
>
>     Some code comments:
>
>     Since core_count() returns 0 for all platforms except linux_x86,
>     couldn't the following code be moved to a platform independent place?
>
>      758 int os::active_core_count() {
>      759   if (os::core_count() <= 0) {
>      760 os::set_core_count(os::active_processor_count());
>      761   }
>      762   return os::core_count();
>      763 }
>
>     That way this code would not have to be duplicated in all other
>     platforms:
>
>      726 int os::active_core_count() {
>      727   return os::active_processor_count();
>      728 }
>
>     We could just always use os::active_core_count() and rely on the
>     value it returns.
>
>
> Sounds good to me.

Good.

>
>
>     VM_Version::calc_parallel_worker_threads()
>
>      862     unsigned int ncpus = (unsigned int) os::active_core_count();
>      863     return (ncpus <= 8) ? ncpus : ncpus / 2;
>
>     Isn't it strange that if we have 8 cores we get 8 parallel
>     threads, but if we have 12 cores we only get 6? Wouldn't it make
>     more sense to treat this limit as a max value, so:
>
>     return (ncpus <= 8) ? ncpus : MAX2(8, ncpus / 2);
>
>
> I think that's better.

Good. :)

>
>
>     concurrentMarkSweepGeneration.cpp
>
>      639       if (AdjustGCThreadsToCores) {
>      640         FLAG_SET_DEFAULT(ConcGCThreads, ParallelGCThreads / 2);
>      641       } else {
>      642         FLAG_SET_DEFAULT(ConcGCThreads, (3 +
>     ParallelGCThreads) / 4);
>      643       }
>
>     This looks a little odd to me. Why do we add 3? That does not seem
>     to scale when we have a large number of parallel threads. Did you
>     mean to multiply by 3?
>
>
> (3 + ParallelGCThreads) / 4 is the upstream default, and I didn't 
> change it.
> ParallelGCThreads can be much smaller with the change, so I divided 
> into 2 instead of 4.

Sorry, I looked too quickly at the diff. You're right of course...

>
>     Also, do we really need to use a different calculation for the
>     ConcGCThreads? It would have been nice to be able to use the same
>     logic for both cases. Did you measure any performance difference
>     in the two cases?
>
>
> Maybe not. Most of the pause time is on the young generation in CMS. I 
> have seen pause time reduction on old generation as well, but anyway 
> that is small.
> As I mentioned earlier, ParallelGCThreads has changed when 
> Hyperthreading or CPU mask is turned on. Hence using the same 
> heuristics can cause too small ConcGCThreads.
> I'd like to use something like "os::active_core_counts() / 4" and make 
> it independent of using ParallelGCThreads.

Yes, I think it sounds like a good idea to decouple ParallelGCThreads 
and ConcGCThreads. Maybe we should do that as a separate change and just 
leave any chagnes to ConcGCThreads out of this patch?

Thanks,
Bengt

>
> Jungwoo

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20131122/7952e6c1/attachment.htm>