[PATCH] Exploit Empty Regions in Young Gen to Enhance PS Full GC Performance

Wed Oct 23 15:15:52 UTC 2019

Hi Stefan,

Thanks for your constructive feedback. I've addressed all the issues you
mentioned, and the updated patch is attached in this email.

During refining the patch, I have a couple of questions:
1) Now the MoveAndUpdateClosure and ShadowClosure assume the destination
address is the very beginning of a region, instead of an arbitrary address
like what it used to be. However, there is an unused function named
PSParallelCompact::move_and_update() uses the MoveAndUpdateClosure to
process a region from its middle, which conflicts with the assumption. I
notice that you removed this function in your patch, and so did I in the
updated patch. Does it matter?
2) Using the same do_addr() in MoveAndUpdateClosure and ShadowClosure is
doable, but it does not reuse all the code neatly. Because storing the
address of the shadow region in _destination requires extra virtual
functions to handle allocating blocks in the start_array and setting
addresses of deferred objects. In particular, allocate_blocks() and
set_deferred_object_for() in both closures are added. Is it worth avoiding
to use _offset to calculate the shadow_destination?

If there are any problems with this patch, please contact me anytime. I'm
more than happy to keep improving the code. Thanks again for reviewing.

Best,
Haoyu Li

Stefan Johansson <stefan.johansson at oracle.com> 于2019年10月22日周二 下午9:42写道：

> Hi Haoyu,
>
> I've reviewed the patch now and have some comments and questions.
>
> To simplify the review and have a common base to look at I've created a
> webrev at:
> http://cr.openjdk.java.net/~sjohanss/8220465/00/
>
> One general note first, most of the new code uses four space
> indentation, in hotspot the standard is two spaces, please change this.
> Below are some file by file comments.
>
> src/hotspot/share/gc/parallel/psCompactionManager.cpp
> ---
>    53 GrowableArray<size_t >* ParCompactionManager::_free_shadow = new
> (ResourceObj::C_HEAP, mtInternal) GrowableArray<size_t >(10, true);
>    54 Monitor*                ParCompactionManager::_monitor = NULL;
>
> Set _free_shadow to NULL here like the other statics and then create the
> GrowableArray in initialize(). I also think _shadow_region_array or
> something like that would be a better name and the monitor should also
> be named something that signals that it is used for this array.
> ---
>    70   if (_monitor == NULL) {
>    71       _monitor = new Monitor(Mutex::barrier, "CompactionManager
> monitor",
>    72                              Mutex::_allow_vm_block_flag,
> Monitor::_safepoint_check_never);
>    73   }
>
> Instead of doing the monitor creation here having to check for NULL, do
> it in initialize() below together with the array creation.
> ---
>
> src/hotspot/share/gc/parallel/psParallelCompact.cpp
> ---
> 2974       if (cur->push()) {
>
> Correct me if I'm wrong, if this call to push() returns true it means
> that nobody else has "stolen" it (used a shadow region to prepare it)
> and we mark it as pushed. But when pushed in this code path this is the
> end state for this RegionData? If this is the case I think it would be
> easier to understand the code if we added another function and state for
> when we "steal" it. Haven't thought very much about the names but I
> think you understand what I want to achieve:
> Normal path:
> UNUSED -> push() -> NORMAL
> Steal path:
> UNUSED -> steal() -> STOLEN -> fill() -> FILLED -> copy() -> SHADOW
>
> We could then also assert in set_completed() that the state is either
> NORMAL or SHADOW (or if they have a shared end state DONE). As I said
> the names can be improved (both for the states and the functions) but I
> think we should have names and not just numbers.
> ---
>
> 3060 template <class T>
> 3061 void PSParallelCompact::fill_region(ParCompactionManager* cm,
> size_t region_idx, size_t shadow, size_t offset)
>
> As I told you this was a big improvement from the first patch, but I
> think there is room for even more improvements around the way we pass in
> ignored parameters to MoveAndUpdateClosure. Explaining my idea in text
> is harder than code, so I created a patch, what do you think about this?
> http://cr.openjdk.java.net/~sjohanss/8220465/00-alt/
>
> This alternative is based on 00 and does not take my other comments into
> consideration. So it might have to be altered a bit if you address some
> of my other comments/questions.
> ---
>
> 3196 void PSParallelCompact::copy_back(HeapWord *region_addr, HeapWord
> *shadow_addr) {
>
> I think the paramenter should change place, so that it corresponds with
> the copy below.
> ---
>
> 3200 bool PSParallelCompact::steal_shadow_region(ParCompactionManager*
> cm, size_t &region_idx) {
> 3201     size_t& record = cm->shadow_record();
>
> Did you consider to just let shadow_record() be a simple getter instead
> of getting a reference and then have a next_shadow_record() which
> advances it by active_workers?
> ---
>
> 3236 void PSParallelCompact::initialize_steal_record(uint which) {
>
> I'm having a hard time understanding the details here, or I get that all
> threads should have a separate shadow record, but I'm not sure why it is
> not enough to just do:
> size_t record = _summary_data.addr_to_region_idx(
>    _space_info[old_space_id].dense_prefix());
> cm->set_shadow_record(record + which);
>
> As you can see I'm also suggesting adding a setter for shadow_record.
> ---
>
> 3434 ParMarkBitMapClosure::IterationStatus
> 3435 ShadowClosure::do_addr(HeapWord* addr, size_t words) {
> 3436     HeapWord* shadow_destination = destination() + _offset;
>
> Using an offset instead of a given address feels a bit backwards, did
> you consider letting the closure keep and update a _shadow_destination
> instead? Or would it even be possible to just set destination to be the
> shadow region address? In that case it should be possible to just use
> the do_addr and other functions from the MoveAndUpdateClosure.
>
> I see from looking at this particular function that there is one assert
> that would have to change:
> 3408
> assert(PSParallelCompact::summary_data().calc_new_pointer(source(),
> compaction_manager()) ==
> 3409          destination(), "wrong destination");
>
> This should be easily fixed by adding a virtual function
> check_destination, that has a special implementation for the ShadowClosure.
> ---
>
> src/hotspot/share/gc/parallel/psParallelCompact.hpp
> ---
>   333     // Preempt the region to avoid double processes
>   334     inline bool push();
>   335     // Mark the region as filled and ready to be copied back
>   336     inline bool fill();
>   337     // Preempt the region to copy the shadow region content back
>   338     inline bool copy();
>
> As mentioned, I think there might be better names for those functions
> and the comments. Maybe adding a prefix would make the code more self
> explaining. try_push(), mark_filled(), try_copy() and the new try_steal().
> ---
>
> Thanks again for providing this patch, I look forward to see an updated
> version.
>
> Cheers,
> Stefan
>
>
> On 2019-10-14 15:00, Stefan Johansson wrote:
> > Thanks for the quick update Haoyu,
> >
> > This is a great improvement and I will try to find time to look into the
> > patch in more detail the coming weeks.
> >
> > Thanks,
> > Stefan
> >
> > On 2019-10-11 14:49, Haoyu Li wrote:
> >> Hi Stefan,
> >>
> >> Thanks for your suggestion! It is very redundant that
> >> PSParallelCompact::fill_shadow_region() copies most code from
> >> PSParallelCompact::fill_region(), and therefore I've refactored these
> >> two functions to share code as many as possible. And the attachment is
> >> the updated patch.
> >>
> >> Specifically, the closure, which moves objects, in
> >> PSParallelCompact::fill_region() is now declared as a template of
> >> either MoveAndUpdateClosure or ShadowClosure. So by controlling the
> >> type of closure when invoking the function, we can decide whether to
> >> fill a normal region or a shadow one. Thus, almost all code in
> >> PSParallelCompact::fill_region() can be reused.
> >>
> >> Besides, a virtual function named complete_region() is added in both
> >> closures to do some work after the filling, such setting states and
> >> copying the shadow region back.
> >>
> >> Thanks again for reviewing the patch, looking forward to your insights
> >> and suggestions!
> >>
> >> Best Regards,
> >> Haoyu Li
> >>
> >> 2019-10-10 21:50 GMT+08:00, Stefan Johansson
> >> <stefan.johansson at oracle.com>:
> >>> Thanks for the clarification =)
> >>>
> >>> Moving on to the next part, the code in the patch. So this won't be a
> >>> full review of the patch but just an initial comment that I would like
> >>> to be addressed first.
> >>>
> >>> The new function PSParallelCompact::fill_shadow_region() is more or
> less
> >>> a copy of PSParallelCompact::fill_region() and I understand that from a
> >>> proof of concept point of view it was the easy (and right) way to do
> it.
> >>> I would prefer if the code could be refactored so that fill_region()
> and
> >>> fill_shadow_region() share more code. There might be reasons that I've
> >>> missed, that prevents it, but we should at least explore how much code
> >>> can be shared.
> >>>
> >>> Thanks,
> >>> Stefan
> >>>
> >>> On 2019-10-10 15:10, Haoyu Li wrote:
> >>>> Hi Stefan,
> >>>>
> >>>> Thanks for your quick response! As to your concern about the OCA, I am
> >>>> the sole author of the patch. And it is the case as what the agreement
> >>>> states.
> >>>> Best Regrads,
> >>>> Haoyu Li,
> >>>>
> >>>>
> >>>> Stefan Johansson <stefan.johansson at oracle.com
> >>>> <mailto:stefan.johansson at oracle.com>> 于2019年10月10日周四 下午8:37
> >>>> 写道：
> >>>>
> >>>>      Hi,
> >>>>
> >>>>      On 2019-10-10 13:06, Haoyu Li wrote:
> >>>>       > Hi Stefan,
> >>>>       >
> >>>>       > Thanks for your testing! One possible reason for the
> >>>> regressions
> >>>> in
> >>>>       > simple tests is that the region dependencies maybe not heavy
> >>>> enough.
> >>>>       > Because the locality of shadow regions is lower than that of
> >>>> heap
> >>>>       > regions, writing to shadow regions will be slower than to
> >>>> normal
> >>>>       > regions, and this is a part of the reason why I reuse shadow
> >>>>      regions.
> >>>>       > Therefore, if only a few shadow regions are created and not
> >>>>      reused, the
> >>>>       > overhead may not be amortized.
> >>>>
> >>>>      I guess it is something like this. I thought that for "easy"
> heaps
> >>>> the
> >>>>      shadow regions won't be used at all, and should therefor not
> >>>> really
> >>>>      cost
> >>>>      anything.
> >>>>
> >>>>       >
> >>>>       > As to the OCA, it is the case that I'm the only person
> >>>> signing the
> >>>>       > agreement. Please let me know if you have any further
> >>>> questions.
> >>>>      Thanks
> >>>>       > again!
> >>>>
> >>>>      Ok, so you are the sole author of the patch. The important
> >>>> part, as
> >>>> the
> >>>>      agreement states, is:
> >>>>      "no other person or entity, including my employer, has or will
> >>>> have
> >>>>      rights with respect my contributions"
> >>>>
> >>>>      Is that the case?
> >>>>
> >>>>      Thanks,
> >>>>      Stefan
> >>>>
> >>>>       >
> >>>>       > Best Regrads,
> >>>>       > Haoyu Li
> >>>>       >
> >>>>       > Stefan Johansson <stefan.johansson at oracle.com
> >>>>      <mailto:stefan.johansson at oracle.com>
> >>>>       > <mailto:stefan.johansson at oracle.com
> >>>>      <mailto:stefan.johansson at oracle.com>>> 于2019年10月8日周二 下午
> >>>> 6:49
> >>>>      写道：
> >>>>       >
> >>>>       >     Hi Haoyu,
> >>>>       >
> >>>>       >     I've done some more testing and I haven't seen any issues
> >>>>      with the
> >>>>       >     patch
> >>>>       >     so far and the performance looks promising in most
> >>>> cases. For
> >>>>      simple
> >>>>       >     tests I've seen some regressions, but I'm not really sure
> >>>>      why. Will do
> >>>>       >     some more digging.
> >>>>       >
> >>>>       >     To move forward with this the first thing we need to do is
> >>>>      making sure
> >>>>       >     that you being covered by the Oracle Contributor
> >>>> Agreement is
> >>>>      enough.
> >>>>       >       From what we can see it is only you as an individual
> that
> >>>>      has signed
> >>>>       >     the OCA and in that case it is important that this
> >>>> statement
> >>>>      from the
> >>>>       >     OCA is fulfilled: "no other person or entity, including my
> >>>>      employer,
> >>>>       >     has
> >>>>       >     or will have rights with respect my contributions"
> >>>>       >
> >>>>       >     Is this the case for this contribution or should we have
> >>>> the
> >>>>      university
> >>>>       >     sign the OCA as well? For more information regarding the
> >>>> OCA
> >>>>      please
> >>>>       >     refer to:
> >>>>       > https://www.oracle.com/technetwork/oca-faq-405384.pdf
> >>>>       >
> >>>>       >     Thanks,
> >>>>       >     Stefan
> >>>>       >
> >>>>       >     On 2019-09-16 16:02, Haoyu Li wrote:
> >>>>       >      > FYI, the evaluation results on OpenJDK 14 are plotted
> in
> >>>> the
> >>>>       >     attachment.
> >>>>       >      > I compute the full GC throughput by dividing the heap
> >>>> size
> >>>>      before
> >>>>       >     full
> >>>>       >      > GC by the GC pause time, and the results are arithmetic
> >>>> mean
> >>>>       >     values of
> >>>>       >      > ten runs after a warm-up run. The evaluation is
> >>>> conducted on
> >>>> a
> >>>>       >     machine
> >>>>       >      > with dual Intel ®XeonTM E5-2618L v3 CPUs (2 sockets, 16
> >>>>      physical
> >>>>       >     cores
> >>>>       >      > with SMT enabled) and 64G DRAM.
> >>>>       >      >
> >>>>       >      > Best Regrads,
> >>>>       >      > Haoyu Li,
> >>>>       >      > Institute of Parallel and Distributed Systems(IPADS),
> >>>>       >      > School of Software,
> >>>>       >      > Shanghai Jiao Tong University
> >>>>       >      >
> >>>>       >      >
> >>>>       >      > Stefan Johansson <stefan.johansson at oracle.com
> >>>>      <mailto:stefan.johansson at oracle.com>
> >>>>       >     <mailto:stefan.johansson at oracle.com
> >>>>      <mailto:stefan.johansson at oracle.com>>
> >>>>       >      > <mailto:stefan.johansson at oracle.com
> >>>>      <mailto:stefan.johansson at oracle.com>
> >>>>       >     <mailto:stefan.johansson at oracle.com
> >>>>      <mailto:stefan.johansson at oracle.com>>>> 于2019年9月12日周四 上
> >>>> 午5:34
> >>>>       >     写道：
> >>>>       >      >
> >>>>       >      >     Hi Haoyu,
> >>>>       >      >
> >>>>       >      >     I recently came across your patch and I would
> >>>> like to
> >>>>      pick up on
> >>>>       >      >     some of the things Kim mentioned in his mails. I
> >>>>      especially want
> >>>>       >      >     evaluate and investigate if this is a technique
> >>>> we can
> >>>>      use to
> >>>>       >      >     improve the other GCs as well. To start that work I
> >>>>      want to
> >>>>       >     take the
> >>>>       >      >     patch for a spin in our internal performance
> >>>> testing.
> >>>>      The patch
> >>>>       >      >     doesn’t apply clean to the latest JDK repository,
> so
> >>>>      if you could
> >>>>       >      >     provide an updated patch that would be very
> helpful.
> >>>>       >      >
> >>>>       >      >     It would also be great if you could share some more
> >>>>      information
> >>>>       >      >     around the results presented in the paper. For
> >>>> example,
> >>>> it
> >>>>       >     would be
> >>>>       >      >     good to get the full command lines for the
> different
> >>>>       >     benchmarks so
> >>>>       >      >     we can run them locally and reproduce the
> >>>>      results you’ve seen.
> >>>>       >      >
> >>>>       >      >     Thanks,
> >>>>       >      >     Stefan
> >>>>       >      >
> >>>>       >      >>     12 mars 2019 kl. 03:21 skrev Haoyu Li
> >>>>      <leihouyju at gmail.com <mailto:leihouyju at gmail.com>
> >>>>       >     <mailto:leihouyju at gmail.com <mailto:leihouyju at gmail.com>>
> >>>>       >      >>     <mailto:leihouyju at gmail.com
> >>>>      <mailto:leihouyju at gmail.com> <mailto:leihouyju at gmail.com
> >>>>      <mailto:leihouyju at gmail.com>>>>:
> >>>>       >      >>
> >>>>       >      >>     Hi Kim,
> >>>>       >      >>
> >>>>       >      >>     Thanks for reviewing and testing the patch. If
> >>>> there
> >>>>      are any
> >>>>       >      >>     failures or performance degradation relevant to
> the
> >>>>      work, please
> >>>>       >      >>     let me know and I'll be very happy to keep
> >>>> improving
> >>>> it.
> >>>>       >     Also, any
> >>>>       >      >>     suggestions about code improvements are well
> >>>> appreciated.
> >>>>       >      >>
> >>>>       >      >>     I'm not quite sure if both G1 and Shenandoah
> >>>> have the
> >>>>      similar
> >>>>       >      >>     region dependency issue, since I haven't studied
> >>>> their
> >>>> GC
> >>>>       >      >>     behaviors before. If they have, I'm also willing
> to
> >>>>      propose
> >>>>       >     a more
> >>>>       >      >>     general optimization.
> >>>>       >      >>
> >>>>       >      >>     As to the memory overhead, I believe it will be
> low
> >>>>      because this
> >>>>       >      >>     patch exploits empty regions in the young space
> >>>>      rather than
> >>>>       >      >>     off-heap memory to allocate shadow regions, and
> >>>> also
> >>>>      reuses the
> >>>>       >      >>     /_source_region/ field of each /RegionData /to
> >>>> record
> >>>> the
> >>>>       >      >>     correspongding shadow region index. We only
> >>>> introduce
> >>>>      a new
> >>>>       >      >>     integer filed /_shadow /in the RegionData class to
> >>>>      indicate the
> >>>>       >      >>     status of a region, a global /GrowableArray
> >>>>      _free_shadow/ to
> >>>>       >     store
> >>>>       >      >>     the indices of shadow regions, and a global
> >>>>      /Monitor/ to protect
> >>>>       >      >>     the array. These information might help if the
> >>>> memory
> >>>>      overhead
> >>>>       >      >>     need to be evaluated.
> >>>>       >      >>
> >>>>       >      >>     Looking forward to your insight.
> >>>>       >      >>
> >>>>       >      >>     Best Regrads,
> >>>>       >      >>     Haoyu Li,
> >>>>       >      >>     Institute of Parallel and Distributed
> >>>> Systems(IPADS),
> >>>>       >      >>     School of Software,
> >>>>       >      >>     Shanghai Jiao Tong University
> >>>>       >      >>
> >>>>       >      >>
> >>>>       >      >>     Kim Barrett <kim.barrett at oracle.com
> >>>>      <mailto:kim.barrett at oracle.com>
> >>>>       >     <mailto:kim.barrett at oracle.com
> >>>> <mailto:kim.barrett at oracle.com>>
> >>>>       >      >>     <mailto:kim.barrett at oracle.com
> >>>>      <mailto:kim.barrett at oracle.com>
> >>>>       >     <mailto:kim.barrett at oracle.com
> >>>>      <mailto:kim.barrett at oracle.com>>>> 于2019年3月12日周二 上午6:11
> >>>> 写道：
> >>>>       >      >>
> >>>>       >      >>         > On Mar 11, 2019, at 1:45 AM, Kim Barrett
> >>>>       >      >>         <kim.barrett at oracle.com
> >>>>      <mailto:kim.barrett at oracle.com> <mailto:kim.barrett at oracle.com
> >>>>      <mailto:kim.barrett at oracle.com>>
> >>>>       >     <mailto:kim.barrett at oracle.com
> >>>>      <mailto:kim.barrett at oracle.com> <mailto:kim.barrett at oracle.com
> >>>>      <mailto:kim.barrett at oracle.com>>>> wrote:
> >>>>       >      >>         >
> >>>>       >      >>         >> On Jan 24, 2019, at 3:58 AM, Haoyu Li
> >>>>       >     <leihouyju at gmail.com <mailto:leihouyju at gmail.com>
> >>>>      <mailto:leihouyju at gmail.com <mailto:leihouyju at gmail.com>>
> >>>>       >      >>         <mailto:leihouyju at gmail.com
> >>>>      <mailto:leihouyju at gmail.com>
> >>>>       >     <mailto:leihouyju at gmail.com <mailto:leihouyju at gmail.com
> >>>>
> >>>>      wrote:
> >>>>       >      >>         >>
> >>>>       >      >>         >> Hi Kim,
> >>>>       >      >>         >>
> >>>>       >      >>         >> I have ported my patch to OpenJDK 13
> >>>> according
> >>>>      to your
> >>>>       >      >>         instructions in your last mail, and the
> >>>> patch is
> >>>>      attached in
> >>>>       >      >>         this mail. The patch does not change much
> since
> >>>>      PSGC is
> >>>>       >     indeed
> >>>>       >      >>         pretty stable.
> >>>>       >      >>         >>
> >>>>       >      >>         >> Also, I evaluate the correctness and
> >>>>      performance of
> >>>>       >     PS full
> >>>>       >      >>         GC with benchmarks from DaCapo, SPECjvm2008,
> >>>> and
> >>>>      JOlden
> >>>>       >     suits
> >>>>       >      >>         on a machine with dual Intel Xeon E5-2618L v3
> >>>> CPUs(16
> >>>>       >     physical
> >>>>       >      >>         cores), 64G DRAM and linux kernel 4.17. The
> >>>>      evaluation
> >>>>       >     result,
> >>>>       >      >>         indicating 1.9X GC throughput improvement on
> >>>>      average, is
> >>>>       >      >>         attached, too.
> >>>>       >      >>         >>
> >>>>       >      >>         >> However, I have no idea how to further test
> >>>> this
> >>>>       >     patch for
> >>>>       >      >>         both correctness and performance. Can I please
> >>>>      get any
> >>>>       >      >>         guidance from you or some sponsor?
> >>>>       >      >>         >
> >>>>       >      >>         > Sorry I missed that you had sent an updated
> >>>>      version of the
> >>>>       >      >>         patch.
> >>>>       >      >>         >
> >>>>       >      >>         > I’ve run the full regression suite across
> >>>>      Oracle-supported
> >>>>       >      >>         platforms.  There are some
> >>>>       >      >>         > failures, but there are almost always some
> >>>>      failures in the
> >>>>       >      >>         later tiers right now.  I’ll start
> >>>>       >      >>         > looking at them tomorrow to figure out
> >>>> whether
> >>>>      any of them
> >>>>       >      >>         are relevant.
> >>>>       >      >>         >
> >>>>       >      >>         > I’m also planning to run some of our
> >>>> performance
> >>>>       >     benchmarks.
> >>>>       >      >>         >
> >>>>       >      >>         > I’ve lightly skimmed the proposed changes.
> >>>>      There might be
> >>>>       >      >>         some code improvements
> >>>>       >      >>         > to be made.
> >>>>       >      >>         >
> >>>>       >      >>         > I’m also wondering if this technique
> >>>> applies to
> >>>>      other
> >>>>       >      >>         collectors.  It seems like both G1 and
> >>>>       >      >>         > Shenandoah full gc’s might have similar
> >>>>      issues?  If so, a
> >>>>       >      >>         solution that is ParallelGC-specific
> >>>>       >      >>         > is less interesting than one that has
> broader
> >>>>       >      >>         applicability.  Though maybe this optimization
> >>>>       >      >>         > is less important for G1 and Shenandoah,
> >>>> since
> >>>> they
> >>>>       >     actively
> >>>>       >      >>         try to avoid full gc’s.
> >>>>       >      >>         >
> >>>>       >      >>         > I’m also not clear on how much additional
> >>>>      memory might be
> >>>>       >      >>         temporarily allocated by this
> >>>>       >      >>         > mechanism.
> >>>>       >      >>
> >>>>       >      >>         I’ve created a CR for this:
> >>>>       >      >> https://bugs.openjdk.java.net/browse/JDK-8220465
> >>>>       >      >>
> >>>>       >      >
> >>>>       >
> >>>>
> >>>
> >>
> >>
>