RFR: Parallelize safepoint cleanup

Mon May 29 14:16:16 UTC 2017

Am 29.05.2017 um 14:53 schrieb Robbin Ehn:
> Hi Roman,
>
> On 05/29/2017 10:23 AM, Roman Kennke wrote:
>> Hi David,
>>
>> the patch applied cleanly, i.e. was not affected by any hotspot changes.
>> Most of your suggestions applied to the GC part that I excluded from the
>> scope of this change, except for the following:
>>
>>>> I introduced a dedicated worker thread pool in SafepointSynchronize.
>>>> This is only initialized when -XX:+ParallelSafepointCleanup is
>>>> enabled,
>>>> and uses -XX:ParallelSafepointCleanupThreads=X threads, defaulting
>>>> to 8
>>>> threads (just a wild guess, open for discussion. With
>>>> -XX:-ParallelSafepointCleanup turned off (the default) it will use the
>>>> old serial safepoint cleanup processing (with optional GC hooks, see
>>>> below).
>>>
>>> I think the default needs to be ergonomically set, as for other
>>> "thread pools". Both flags should be experimental at this stage.
>>
>> I implemented this using the following:
>>
>>    if (ParallelSafepointCleanup &&
>> FLAG_IS_DEFAULT(ParallelSafepointCleanupThreads)) {
>>      // This will pick up ParallelGCThreads if set...
>>      FLAG_SET_DEFAULT(ParallelSafepointCleanupThreads,
>> Abstract_VM_Version::parallel_worker_threads());
>>    }
>>
>> As noted, it will pick up ParallelGCThreads, which might be a reasonable
>> number. Otherwise we'd need to re-work the logic in vm_version.cpp to
>> calculate the number of worker threads without considering
>> ParallelGCThreads.
>>
>> http://cr.openjdk.java.net/~rkennke/8180932/webrev.02/
>> <http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.02/>
>>
>
> This worries me because heuristics will be very hard to the average user.
>
> 1885   if (ParallelSafepointCleanup &&
> FLAG_IS_DEFAULT(ParallelSafepointCleanupThreads)) {
> 1886     // This will pick up ParallelGCThreads if set...
> 1887     FLAG_SET_DEFAULT(ParallelSafepointCleanupThreads,
> Abstract_VM_Version::parallel_worker_threads());
> 1888   }
>
> So on a 8 vcpu machine we could be running 8 threads cleanup and 8
> workers threads might be marking?
>
> "and I now think this is the best solution"
>
> I don't agree that is the the best solution and I'm not in favor of a
> second thread pool.
> If you want short safepoints it seem much better to use available cpu
> time finishing the cleanup.
> Maybe steal or reserve worker threads... just keep 1 pool of thread
> which we can do decent sizing for.
>
> "For example, in one popular benchmark on a server-class machine, I got
> total average pause time down from ~80ms to ~30ms."
>
> What's the numbers for a 4 core/8 threads machine? And other GCs?
There are workloads and machine configs that are not thread-heavy, and
in no case have I seen regressions in pause time.

I agree that having a single pool would be good. The current WorkGang
doesn't do it though, because we can't borrow threads while the GC is
doing work, at least not in a way that is GC agnostic (see my reply to
Robbin).

> Looked through the patch quick, I think you should do some abstraction.
> Now this code is already lacking abstraction so I understand why you
> did it this way.
>
> For example:
> void ObjectSynchronizer::deflate_idle_monitors(bool
> deflate_thread_local_monitors)
> ...
>   if (MonitorInUseLists) {
>     if (deflate_thread_local_monitors) {
>
> You have mandatory parameter which is ignore by a global flag.
> Right now there are 4 different cases in the same code path.
> Does ParallelSafepointCleanup work with -MonitorInUseLists ?
> Please create some kind of abstraction for this and set it a startup.
Yeah. In the meantime, I reduced the scope of this patch (no longer
including the above), see current webrev:
http://cr.openjdk.java.net/~rkennke/8180932/webrev.02/
<http://cr.openjdk.java.net/%7Erkennke/8180932/webrev.02/>

I agree that better abstraction would be good.

> I do not see any reason for duplicating the
> SafepointSynchronize::serial_cleanup(), just create proper abstraction
> so it can be used in both cases?

Yeah, I will think about it. The easiest way out would be to treat the
serial case as 1-thread-MT case (called by VMThread), but this also
means we can't easily keep the existing code path as-is while having a
new separate code path. Maybe give the new code path some testing and
time to iron out bugs (if any) and later switch the old path to a
degenerated 1-threaded-MT path?

Roman