8152312: ParNew: Restore preserved marks in parallel

Tony Printezis tprintezis at twitter.com
Fri Mar 25 13:35:31 UTC 2016


Thomas,

Inline.

On March 25, 2016 at 8:56:59 AM, Thomas Schatzl (thomas.schatzl at oracle.com) wrote:

Hi, 

On Thu, 2016-03-24 at 11:36 -0400, Tony Printezis wrote: 
> Thomas, 
> 
> Thanks for the feedback. Updated webrev here: 
> 
> http://cr.openjdk.java.net/~tonyp/8152312/webrev.1/ 
> 
> Changes: 
> 
> - Your main comment was whether to make the serial vs. parallel 
> decision in DefNew or in restore(). I liked the suggestion of either 
> passing NULL or the workers to restore(WorkGang*) and deciding in 
> restore(…) what to do. The only problem is that ParallelGC doesn’t 
> use a WorkGang, it uses a GCTaskManager instead. 

Drat! 


Indeed. :-)




> And anything we do in restore(...) would be nice to also take 
> advantage of in ParallelGC. So, I came up with a cunning scheme to do 
> that. Let me know what you think. :-) Yes, it works fine with 
> ParallelGC too (I already implemented/tested the changes, I’ll open 
> them for code review once this is done.) I left a restore() method 
> there temporarily to be used by ParallelGC but it will go away. 

Some specialized implementation of the method based on the type of E I 
guess. 


First, if it helps you, here are the changes for PS (as applied on top of the ParNew changes):

http://cr.openjdk.java.net/~tonyp/8152312/webrev.1.ps/




Note that we now basically duplicate the restore() method, 


Depends which restore() method you’re talking about. restore(E*) remains unchanged in the PS changes. We have to introduce a new restore_internal() that knows how to use the GCTaskManager (and, yes, there’s a bit of code replication in that task; but only a few lines). So, in the future, if we add some additional logic to decide whether to do the restoration serially or in parallel, we’ll only have to change restore(E*) and the “executor”-specific code remains the same (I have hacked up a version of that too, as a proof-of-concept, and it does work out fine, FWIW).



it gets 
close to just have two (parallel gc, the rest) implementations of the 
class. Let's see your solution. 


We have a single method that decides what to do (serial or parallel restoration) and yields to the specific parallel implementation as needed. 

I used a template so that I can put the shared implementation in one place and still be able to easily call the “executor”-specific methods when needed without having lots of ridiculous boilerplate code (pass both executors to the method, make sure only one of them is non-NULL, based on which one is not NULL call the appropriate parallel method, etc.).

If you don’t like the use of the template here, there’s an alternative. Expose the following:

restore(WorkGang* workers)

restore(GCTaskManager* gc_task_manager)

and each if those first calls a private restore_internal() which either does the restoration serially or decides not to and tells the “executor”-specific restore() that it needs to do the work in parallel. But you said in your e-mail that you wanted a single restore() which yields to the parallel when needed, which is what I tried to do in the webrev. :-)

Either way is totally fine with me. Pick your poison.




Since I do not know the actual implementation of your scheme, but it 
would be nice if the code between lines 65 and 68 in 
preservedMarks.inline.hpp were moved into a method too, particularly if 
the future implementation also has a single-threaded path.  


It doesn’t, that loop is not replicated (see the PS webrev).




I would prefer to have the implementation(s) of restore() in the cpp 
file(s) though


I agree. But, given that restore(E*) is a template, I assume I have to put it in the .inline.hpp file. If you really want to in the .cpp file, I think we have to go with the scheme I describe above (expose restore(WorkGang*) and restore(GCTaskManager*)).



 (and possibly hide the one for parallel somewhere very 
far away ;) ). 


:-)



The additional call in this case does not seem to be 
problematic vs. performance. 


I totally agree with this.




> - Yes, if each stack has a small number of entries we’re better off 
> doing the restoration serially instead of firing up the workers. This 
> will be a small localized change to what I have in the webrev (we’ll 
> only have to change restore(E*); 

One comment here: the comment at 

70 // Right now, if the executor is not NULL we do the work in 
71 // parallel. In the future we might want to do the restoration 
72 // serially, if there's only a small number of marks per 
stack. 

Imo is better placed before line 64, cut a lot to something like: 

// Decide on the amount of parallelism. 

Please add an RFE in the bug tracker explaining the situation about "in 
the future" ideas. That would be much better than dumping your thoughts 
in some random place in the code. 


I put the comment when we were calling the parallel version, so I didn’t think it was a random place. :-) But, sure, I’ll move it earlier...




> the main question will be what policy to use!). I'll do a follow-up 
> to this too. 

The main goal is not to be perfect imo, but to determine a reasonable 
somewhat conservative upper bound on the amount of threads used. Too 
many people use too large machines on small problems. 


I agree, I don’t think it needs to be perfect and we should spend too much time on it. I was actually going to propose a very simple scheme that makes the determination based on the size of the stacks. I.e., start iterating over the stacks and, for every stack we check its size. If it's under a limit we just go ahead and do the work serially. Eventually, if we come across a stack whose size is “too large” we bail out of the serial phase and yield of the parallel version (and we can notify the parallel version that we’ve already done N stacks and it should ignore them). Even if there’s a small number of GC threads (2 / 4), it’s still worth doing the work serially if only a small number of marks have been preserved. (IMHO)




One could argue, that work units < 1ms (or another low value) are not 
worth distributing.


I agree.



 From that one could determine a "reasonable" 
minimum number of references to process per thread, from which you 
could then deduct a range of PreservedMarks buffers that each thread 
could use (assuming that the actual references are evenly distributed 
across these sets of references). 

At least that's what I would start with. 

> - Ah, yes, I’m glad there’s an atomic add for size_t in JDK 9. :-) 
> 
> - Re: ParRestoreTask vs. RestoreTask : I tend to add “Par” as the 
> tasks should be safe to use in parallel (even if in practice they are 
> not). Hope that’s OK. 

Another reason is that nowadays practically everything done during a 
STW pause must somehow be implemented to run in parallel to be actually 
useful (except for serial gc obviously). So this marker is kind of 
misleading and redundant. 


Depends whether you see “Par” as “it will be done in parallel” or “it might be done in parallel”. I see it as the latter (and there’s lots of cases in the code where Par is used for such classes). 



Tony




Not really insisting on this, but it may be that some future general cleanup will modify this (nothing actually planned from me). 



Thanks, 
Thomas 












-----

Tony Printezis | JVM/GC Engineer / VM Team | Twitter

@TonyPrintezis
tprintezis at twitter.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20160325/6e47d4e3/attachment.htm>


More information about the hotspot-gc-dev mailing list