RFR: 8080106: Refactor setup of parallel GC threads

Tue May 19 12:30:47 UTC 2015

Hi Kim,

On 2015-05-18 23:21, Kim Barrett wrote:
> On May 18, 2015, at 7:17 AM, Stefan Karlsson <stefan.karlsson at oracle.com> wrote:
>> On 2015-05-14 05:01, Kim Barrett wrote:
>>> ==============================================================================
>>> http://cr.openjdk.java.net/~stefank/8080110/webrev.00
>>> 8080110: Remove usage of CollectedHeap::n_par_threads() from root processing
>>>
>>> I would have found this easier to review if the description contained
>>> some discussion of the elimination of MarkScope::_active.
>> Yes. I actually intended to move the setup of StrongRootScope and the removal of MarkScope::_active as a separate patch, but that patch was hard to motivate without the rest of 8080110.
>>
>> MarkScope::_active was used to allow the single-threaded code to call gen_process_roots without having to create explicitly create s StrongRootsScope before calling gen_process_roots. With this patch all callers of gen_process_roots have to pass a pointer to a StrongRootsScope as an argument to gen_process_roots, so there's no need to keep the MarkSweep::_active variable anymore.
> Thanks, that confirms what I’d puzzled out.

Great.

>
>>> ==============================================================================
>>> http://cr.openjdk.java.net/~stefank/8080112/webrev.00
>>> 8080112: Replace and remove the last usages of CollectedHeap::n_par_threads()
>>>
>>> ------------------------------------------------------------------------------
>>> src/share/vm/gc_implementation/concurrentMarkSweep/compactibleFreeListSpace.cpp
>>> 1899   if (rem_sz < SmallForDictionary) {
>>> 1900     // The freeList lock is held, but multiple GC task threads might be executing in parallel.
>>> 1901     bool is_par = Thread::current()->is_GC_task_thread();
>>> 1902     if (is_par) _indexedFreeListParLocks[rem_sz]->lock();
>>> 1903     returnChunkToFreeList(ffc);
>>> 1904     split(size, rem_sz);
>>> 1905     if (is_par) _indexedFreeListParLocks[rem_sz]->unlock();
>>> 1906   } else {
>>> 1907     returnChunkToDictionary(ffc);
>>> 1908     split(size, rem_sz);
>>> 1909   }
>>>
>>> Conditional locking could be eliminated by changing the test on line
>>> 1899 to also test the is_par expression.  OTOH, see next comment.
>> One branch calls returnChunkToFreeList while the other calls returnChunkToDictionary, so I'm not sure I understand what you suggest.
> I’m not surprised you don’t understand, since I don’t either.  I think I missed that different returnChunkXXX functions were being called.

:)

>
>>> ------------------------------------------------------------------------------
>>> src/share/vm/gc_implementation/concurrentMarkSweep/compactibleFreeListSpace.cpp
>>> 1900     bool is_par = (GenCollectedHeap::heap()->n_par_threads() > 0);
>>> =>
>>> 1901     bool is_par = Thread::current()->is_GC_task_thread();
>>>
>>> This new mechanism for computing is_par seems like it might be
>>> substantially more expensive than the old mechanism.  I'm not sure
>>> whether it's likely to be noticably so.
>> Do you mean the call to Thead::current()? Do you have any suggestion on how to rewrite this?
> Yes, as that presumably involves TLS access.  Also, is_GC_task_thread is a virtual function, where n_par_threads is just a data member access.
>
> I don’t know if the difference is noticeable, in amongst all the other stuff going on.  But this looks to me like it might be a pretty common code path.  And no, I don’t have a better suggestion right now.  Knowing the appropriate value at some high level in the call chain and passing it down through all the relevant calls looks like it would be pretty ugly.
>
> Although it seems that for the n_par_threads-based version to work and produce the same result, there must be some global context switch between the is_par and !is_par state.  Is there some other bit in the heap that corresponds to that context?  E.g. do GC-task calls into here only occur during pauses and other calls only during non-pauses, or something like that?  But then there might be a single-threaded pause case?  Sorry, I don’t know the relevant code, so might be just making things up.

I added code to splitChunkAndReturnRemainder to count the number of 
times we execute the new Thread::current()->is_GC_task_thread() call. 
Out of the benchmarks in gc-test-suite, I see that GCBasher is the 
benchmark that causes the JVM to execute this path the most. In average 
the JVM enters this path about 100 times per GC, which I think is a 
fairly low number compared to everything else we're doing during the GCs.

I then used a profiler to measure the time spent in 
splitChunkAndReturnRemainder:

    0.00%     VM Thread  libjvm.so  [.] 
_ZN24CompactibleFreeListSpace28splitChunkAndReturnRemainderEP9FreeChunkm
    0.00%  GC Thread#22  libjvm.so  [.] 
_ZN24CompactibleFreeListSpace28splitChunkAndReturnRemainderEP9FreeChunkm
    0.00%   GC Thread#6  libjvm.so  [.] 
_ZN24CompactibleFreeListSpace28splitChunkAndReturnRemainderEP9FreeChunkm
    0.00%  GC Thread#10  libjvm.so  [.] 
_ZN24CompactibleFreeListSpace28splitChunkAndReturnRemainderEP9FreeChunkm
    0.00%   GC Thread#5  libjvm.so  [.] 
_ZN24CompactibleFreeListSpace28splitChunkAndReturnRemainderEP9FreeChunkm
... etc ...

The samples show that the Thread::current() call takes a measurable 
amount of that time:

        │       callq 
BlockOffsetArrayNonContigSpace::split_block(HeapWord*, unsigned long, 
unsigned long)
   1.79 │       cmpq   $0x100,-0x38(%rbp)
   5.36 │       ja     110
        │       lea ThreadLocalStorage::_thread_index,%rax
   1.79 │       mov    (%rax),%edi
   1.79 │       callq  pthread_getspecific at plt
        │       mov    (%rax),%rdx
  12.50 │       mov    %rax,%rdi
        │       callq  *0x50(%rdx)
        │       test   %al,%al

but since we spend very little time in this code, I don't think it's 
significant enough to optimize.

And as a reference, this is the "top" of the profiler output:

    4.31%  CMS Main Thread  libjvm.so           [.] 
_ZNK24CompactibleFreeListSpace4freeEv
    2.59%             java  libzip.so           [.] inflate_fast
    1.69%  CMS Main Thread  libjvm.so           [.] 
_ZN12SweepClosure14do_blk_carefulEP8HeapWord
    1.31%      GC Thread#2  libjvm.so           [.] 
_ZN16ParNewGeneration22copy_to_survivor_spaceEP18ParScanThreadStateP7oopDescmP11markOopDesc
    1.31%     GC Thread#19  libjvm.so           [.] 
_ZN16ParNewGeneration22copy_to_survivor_spaceEP18ParScanThreadStateP7oopDescmP11markOopDesc
    1.30%     GC Thread#14  libjvm.so           [.] 
_ZN16ParNewGeneration22copy_to_survivor_spaceEP18ParScanThreadStateP7oopDescmP11markOopDesc
    1.29%     GC Thread#13  libjvm.so           [.] 
_ZN16ParNewGeneration22copy_to_survivor_spaceEP18ParScanThreadStateP7oopDescmP11markOopDesc 

... etc ...

And to make sure that splitChunkAndReturnRemainder wasn't inlined, I 
moved it to a separate .cpp file to "hide" it from all call sites.

Do you want me to do any further experiments with this?

>
>>> ------------------------------------------------------------------------------
>>> src/share/vm/gc_implementation/parNew/parCardTableModRefBS.cpp
>>>    44   assert(n_threads <= (uint)ParallelGCThreads, "# worker threads != # requested!");
>>>
>>> Pre-existing defect: The error message says "!=" but the test is "<=".
>> Changed to:
>>   assert(n_threads <= (uint)ParallelGCThreads, "# worker threads < # requested!”);
> Shouldn’t that be “>” in the error message?

I think it depends on how you interpret "# requested". I think I'll 
change the assert message to be err_msg("n_threads: %u > 
ParallelGCThreads: %u", ...), or something like that.

Thanks,
StefanK

>
>