[PATCH] Exploit Empty Regions in Young Gen to Enhance PS Full GC Performance

Tue Oct 22 13:41:46 UTC 2019

Hi Haoyu,

I've reviewed the patch now and have some comments and questions.

To simplify the review and have a common base to look at I've created a 
webrev at:
http://cr.openjdk.java.net/~sjohanss/8220465/00/

One general note first, most of the new code uses four space 
indentation, in hotspot the standard is two spaces, please change this. 
Below are some file by file comments.

src/hotspot/share/gc/parallel/psCompactionManager.cpp
---
   53 GrowableArray<size_t >* ParCompactionManager::_free_shadow = new 
(ResourceObj::C_HEAP, mtInternal) GrowableArray<size_t >(10, true);
   54 Monitor*                ParCompactionManager::_monitor = NULL;

Set _free_shadow to NULL here like the other statics and then create the 
GrowableArray in initialize(). I also think _shadow_region_array or 
something like that would be a better name and the monitor should also 
be named something that signals that it is used for this array.
---
   70   if (_monitor == NULL) {
   71       _monitor = new Monitor(Mutex::barrier, "CompactionManager 
monitor",
   72                              Mutex::_allow_vm_block_flag, 
Monitor::_safepoint_check_never);
   73   }

Instead of doing the monitor creation here having to check for NULL, do 
it in initialize() below together with the array creation.
---

src/hotspot/share/gc/parallel/psParallelCompact.cpp
---
2974       if (cur->push()) {

Correct me if I'm wrong, if this call to push() returns true it means 
that nobody else has "stolen" it (used a shadow region to prepare it) 
and we mark it as pushed. But when pushed in this code path this is the 
end state for this RegionData? If this is the case I think it would be 
easier to understand the code if we added another function and state for 
when we "steal" it. Haven't thought very much about the names but I 
think you understand what I want to achieve:
Normal path:
UNUSED -> push() -> NORMAL
Steal path:
UNUSED -> steal() -> STOLEN -> fill() -> FILLED -> copy() -> SHADOW

We could then also assert in set_completed() that the state is either 
NORMAL or SHADOW (or if they have a shared end state DONE). As I said 
the names can be improved (both for the states and the functions) but I 
think we should have names and not just numbers.
---

3060 template <class T>
3061 void PSParallelCompact::fill_region(ParCompactionManager* cm, 
size_t region_idx, size_t shadow, size_t offset)

As I told you this was a big improvement from the first patch, but I 
think there is room for even more improvements around the way we pass in 
ignored parameters to MoveAndUpdateClosure. Explaining my idea in text 
is harder than code, so I created a patch, what do you think about this?
http://cr.openjdk.java.net/~sjohanss/8220465/00-alt/

This alternative is based on 00 and does not take my other comments into 
consideration. So it might have to be altered a bit if you address some 
of my other comments/questions.
---

3196 void PSParallelCompact::copy_back(HeapWord *region_addr, HeapWord 
*shadow_addr) {

I think the paramenter should change place, so that it corresponds with 
the copy below.
---

3200 bool PSParallelCompact::steal_shadow_region(ParCompactionManager* 
cm, size_t &region_idx) {
3201     size_t& record = cm->shadow_record();

Did you consider to just let shadow_record() be a simple getter instead 
of getting a reference and then have a next_shadow_record() which 
advances it by active_workers?
---

3236 void PSParallelCompact::initialize_steal_record(uint which) {

I'm having a hard time understanding the details here, or I get that all 
threads should have a separate shadow record, but I'm not sure why it is 
not enough to just do:
size_t record = _summary_data.addr_to_region_idx(
   _space_info[old_space_id].dense_prefix());
cm->set_shadow_record(record + which);

As you can see I'm also suggesting adding a setter for shadow_record.
---

3434 ParMarkBitMapClosure::IterationStatus
3435 ShadowClosure::do_addr(HeapWord* addr, size_t words) {
3436     HeapWord* shadow_destination = destination() + _offset;

Using an offset instead of a given address feels a bit backwards, did 
you consider letting the closure keep and update a _shadow_destination 
instead? Or would it even be possible to just set destination to be the 
shadow region address? In that case it should be possible to just use 
the do_addr and other functions from the MoveAndUpdateClosure.

I see from looking at this particular function that there is one assert 
that would have to change:
3408 
assert(PSParallelCompact::summary_data().calc_new_pointer(source(), 
compaction_manager()) ==
3409          destination(), "wrong destination");

This should be easily fixed by adding a virtual function 
check_destination, that has a special implementation for the ShadowClosure.
---

src/hotspot/share/gc/parallel/psParallelCompact.hpp
---
  333     // Preempt the region to avoid double processes
  334     inline bool push();
  335     // Mark the region as filled and ready to be copied back
  336     inline bool fill();
  337     // Preempt the region to copy the shadow region content back
  338     inline bool copy();

As mentioned, I think there might be better names for those functions 
and the comments. Maybe adding a prefix would make the code more self 
explaining. try_push(), mark_filled(), try_copy() and the new try_steal().
---

Thanks again for providing this patch, I look forward to see an updated 
version.

Cheers,
Stefan

On 2019-10-14 15:00, Stefan Johansson wrote:
> Thanks for the quick update Haoyu,
> 
> This is a great improvement and I will try to find time to look into the 
> patch in more detail the coming weeks.
> 
> Thanks,
> Stefan
> 
> On 2019-10-11 14:49, Haoyu Li wrote:
>> Hi Stefan,
>>
>> Thanks for your suggestion! It is very redundant that
>> PSParallelCompact::fill_shadow_region() copies most code from
>> PSParallelCompact::fill_region(), and therefore I've refactored these
>> two functions to share code as many as possible. And the attachment is
>> the updated patch.
>>
>> Specifically, the closure, which moves objects, in
>> PSParallelCompact::fill_region() is now declared as a template of
>> either MoveAndUpdateClosure or ShadowClosure. So by controlling the
>> type of closure when invoking the function, we can decide whether to
>> fill a normal region or a shadow one. Thus, almost all code in
>> PSParallelCompact::fill_region() can be reused.
>>
>> Besides, a virtual function named complete_region() is added in both
>> closures to do some work after the filling, such setting states and
>> copying the shadow region back.
>>
>> Thanks again for reviewing the patch, looking forward to your insights
>> and suggestions!
>>
>> Best Regards,
>> Haoyu Li
>>
>> 2019-10-10 21:50 GMT+08:00, Stefan Johansson 
>> <stefan.johansson at oracle.com>:
>>> Thanks for the clarification =)
>>>
>>> Moving on to the next part, the code in the patch. So this won't be a
>>> full review of the patch but just an initial comment that I would like
>>> to be addressed first.
>>>
>>> The new function PSParallelCompact::fill_shadow_region() is more or less
>>> a copy of PSParallelCompact::fill_region() and I understand that from a
>>> proof of concept point of view it was the easy (and right) way to do it.
>>> I would prefer if the code could be refactored so that fill_region() and
>>> fill_shadow_region() share more code. There might be reasons that I've
>>> missed, that prevents it, but we should at least explore how much code
>>> can be shared.
>>>
>>> Thanks,
>>> Stefan
>>>
>>> On 2019-10-10 15:10, Haoyu Li wrote:
>>>> Hi Stefan,
>>>>
>>>> Thanks for your quick response! As to your concern about the OCA, I am
>>>> the sole author of the patch. And it is the case as what the agreement
>>>> states.
>>>> Best Regrads,
>>>> Haoyu Li,
>>>>
>>>>
>>>> Stefan Johansson <stefan.johansson at oracle.com
>>>> <mailto:stefan.johansson at oracle.com>> 于2019年10月10日周四 下午8:37 
>>>> 写道：
>>>>
>>>>      Hi,
>>>>
>>>>      On 2019-10-10 13:06, Haoyu Li wrote:
>>>>       > Hi Stefan,
>>>>       >
>>>>       > Thanks for your testing! One possible reason for the 
>>>> regressions
>>>> in
>>>>       > simple tests is that the region dependencies maybe not heavy
>>>> enough.
>>>>       > Because the locality of shadow regions is lower than that of 
>>>> heap
>>>>       > regions, writing to shadow regions will be slower than to 
>>>> normal
>>>>       > regions, and this is a part of the reason why I reuse shadow
>>>>      regions.
>>>>       > Therefore, if only a few shadow regions are created and not
>>>>      reused, the
>>>>       > overhead may not be amortized.
>>>>
>>>>      I guess it is something like this. I thought that for "easy" heaps
>>>> the
>>>>      shadow regions won't be used at all, and should therefor not 
>>>> really
>>>>      cost
>>>>      anything.
>>>>
>>>>       >
>>>>       > As to the OCA, it is the case that I'm the only person 
>>>> signing the
>>>>       > agreement. Please let me know if you have any further 
>>>> questions.
>>>>      Thanks
>>>>       > again!
>>>>
>>>>      Ok, so you are the sole author of the patch. The important 
>>>> part, as
>>>> the
>>>>      agreement states, is:
>>>>      "no other person or entity, including my employer, has or will 
>>>> have
>>>>      rights with respect my contributions"
>>>>
>>>>      Is that the case?
>>>>
>>>>      Thanks,
>>>>      Stefan
>>>>
>>>>       >
>>>>       > Best Regrads,
>>>>       > Haoyu Li
>>>>       >
>>>>       > Stefan Johansson <stefan.johansson at oracle.com
>>>>      <mailto:stefan.johansson at oracle.com>
>>>>       > <mailto:stefan.johansson at oracle.com
>>>>      <mailto:stefan.johansson at oracle.com>>> 于2019年10月8日周二 下午 
>>>> 6:49
>>>>      写道：
>>>>       >
>>>>       >     Hi Haoyu,
>>>>       >
>>>>       >     I've done some more testing and I haven't seen any issues
>>>>      with the
>>>>       >     patch
>>>>       >     so far and the performance looks promising in most 
>>>> cases. For
>>>>      simple
>>>>       >     tests I've seen some regressions, but I'm not really sure
>>>>      why. Will do
>>>>       >     some more digging.
>>>>       >
>>>>       >     To move forward with this the first thing we need to do is
>>>>      making sure
>>>>       >     that you being covered by the Oracle Contributor 
>>>> Agreement is
>>>>      enough.
>>>>       >       From what we can see it is only you as an individual that
>>>>      has signed
>>>>       >     the OCA and in that case it is important that this 
>>>> statement
>>>>      from the
>>>>       >     OCA is fulfilled: "no other person or entity, including my
>>>>      employer,
>>>>       >     has
>>>>       >     or will have rights with respect my contributions"
>>>>       >
>>>>       >     Is this the case for this contribution or should we have 
>>>> the
>>>>      university
>>>>       >     sign the OCA as well? For more information regarding the 
>>>> OCA
>>>>      please
>>>>       >     refer to:
>>>>       > https://www.oracle.com/technetwork/oca-faq-405384.pdf
>>>>       >
>>>>       >     Thanks,
>>>>       >     Stefan
>>>>       >
>>>>       >     On 2019-09-16 16:02, Haoyu Li wrote:
>>>>       >      > FYI, the evaluation results on OpenJDK 14 are plotted in
>>>> the
>>>>       >     attachment.
>>>>       >      > I compute the full GC throughput by dividing the heap 
>>>> size
>>>>      before
>>>>       >     full
>>>>       >      > GC by the GC pause time, and the results are arithmetic
>>>> mean
>>>>       >     values of
>>>>       >      > ten runs after a warm-up run. The evaluation is 
>>>> conducted on
>>>> a
>>>>       >     machine
>>>>       >      > with dual Intel ®XeonTM E5-2618L v3 CPUs (2 sockets, 16
>>>>      physical
>>>>       >     cores
>>>>       >      > with SMT enabled) and 64G DRAM.
>>>>       >      >
>>>>       >      > Best Regrads,
>>>>       >      > Haoyu Li,
>>>>       >      > Institute of Parallel and Distributed Systems(IPADS),
>>>>       >      > School of Software,
>>>>       >      > Shanghai Jiao Tong University
>>>>       >      >
>>>>       >      >
>>>>       >      > Stefan Johansson <stefan.johansson at oracle.com
>>>>      <mailto:stefan.johansson at oracle.com>
>>>>       >     <mailto:stefan.johansson at oracle.com
>>>>      <mailto:stefan.johansson at oracle.com>>
>>>>       >      > <mailto:stefan.johansson at oracle.com
>>>>      <mailto:stefan.johansson at oracle.com>
>>>>       >     <mailto:stefan.johansson at oracle.com
>>>>      <mailto:stefan.johansson at oracle.com>>>> 于2019年9月12日周四 上 
>>>> 午5:34
>>>>       >     写道：
>>>>       >      >
>>>>       >      >     Hi Haoyu,
>>>>       >      >
>>>>       >      >     I recently came across your patch and I would 
>>>> like to
>>>>      pick up on
>>>>       >      >     some of the things Kim mentioned in his mails. I
>>>>      especially want
>>>>       >      >     evaluate and investigate if this is a technique 
>>>> we can
>>>>      use to
>>>>       >      >     improve the other GCs as well. To start that work I
>>>>      want to
>>>>       >     take the
>>>>       >      >     patch for a spin in our internal performance 
>>>> testing.
>>>>      The patch
>>>>       >      >     doesn’t apply clean to the latest JDK repository, so
>>>>      if you could
>>>>       >      >     provide an updated patch that would be very helpful.
>>>>       >      >
>>>>       >      >     It would also be great if you could share some more
>>>>      information
>>>>       >      >     around the results presented in the paper. For 
>>>> example,
>>>> it
>>>>       >     would be
>>>>       >      >     good to get the full command lines for the different
>>>>       >     benchmarks so
>>>>       >      >     we can run them locally and reproduce the
>>>>      results you’ve seen.
>>>>       >      >
>>>>       >      >     Thanks,
>>>>       >      >     Stefan
>>>>       >      >
>>>>       >      >>     12 mars 2019 kl. 03:21 skrev Haoyu Li
>>>>      <leihouyju at gmail.com <mailto:leihouyju at gmail.com>
>>>>       >     <mailto:leihouyju at gmail.com <mailto:leihouyju at gmail.com>>
>>>>       >      >>     <mailto:leihouyju at gmail.com
>>>>      <mailto:leihouyju at gmail.com> <mailto:leihouyju at gmail.com
>>>>      <mailto:leihouyju at gmail.com>>>>:
>>>>       >      >>
>>>>       >      >>     Hi Kim,
>>>>       >      >>
>>>>       >      >>     Thanks for reviewing and testing the patch. If 
>>>> there
>>>>      are any
>>>>       >      >>     failures or performance degradation relevant to the
>>>>      work, please
>>>>       >      >>     let me know and I'll be very happy to keep 
>>>> improving
>>>> it.
>>>>       >     Also, any
>>>>       >      >>     suggestions about code improvements are well
>>>> appreciated.
>>>>       >      >>
>>>>       >      >>     I'm not quite sure if both G1 and Shenandoah 
>>>> have the
>>>>      similar
>>>>       >      >>     region dependency issue, since I haven't studied 
>>>> their
>>>> GC
>>>>       >      >>     behaviors before. If they have, I'm also willing to
>>>>      propose
>>>>       >     a more
>>>>       >      >>     general optimization.
>>>>       >      >>
>>>>       >      >>     As to the memory overhead, I believe it will be low
>>>>      because this
>>>>       >      >>     patch exploits empty regions in the young space
>>>>      rather than
>>>>       >      >>     off-heap memory to allocate shadow regions, and 
>>>> also
>>>>      reuses the
>>>>       >      >>     /_source_region/ field of each /RegionData /to 
>>>> record
>>>> the
>>>>       >      >>     correspongding shadow region index. We only 
>>>> introduce
>>>>      a new
>>>>       >      >>     integer filed /_shadow /in the RegionData class to
>>>>      indicate the
>>>>       >      >>     status of a region, a global /GrowableArray
>>>>      _free_shadow/ to
>>>>       >     store
>>>>       >      >>     the indices of shadow regions, and a global
>>>>      /Monitor/ to protect
>>>>       >      >>     the array. These information might help if the 
>>>> memory
>>>>      overhead
>>>>       >      >>     need to be evaluated.
>>>>       >      >>
>>>>       >      >>     Looking forward to your insight.
>>>>       >      >>
>>>>       >      >>     Best Regrads,
>>>>       >      >>     Haoyu Li,
>>>>       >      >>     Institute of Parallel and Distributed 
>>>> Systems(IPADS),
>>>>       >      >>     School of Software,
>>>>       >      >>     Shanghai Jiao Tong University
>>>>       >      >>
>>>>       >      >>
>>>>       >      >>     Kim Barrett <kim.barrett at oracle.com
>>>>      <mailto:kim.barrett at oracle.com>
>>>>       >     <mailto:kim.barrett at oracle.com
>>>> <mailto:kim.barrett at oracle.com>>
>>>>       >      >>     <mailto:kim.barrett at oracle.com
>>>>      <mailto:kim.barrett at oracle.com>
>>>>       >     <mailto:kim.barrett at oracle.com
>>>>      <mailto:kim.barrett at oracle.com>>>> 于2019年3月12日周二 上午6:11 
>>>> 写道：
>>>>       >      >>
>>>>       >      >>         > On Mar 11, 2019, at 1:45 AM, Kim Barrett
>>>>       >      >>         <kim.barrett at oracle.com
>>>>      <mailto:kim.barrett at oracle.com> <mailto:kim.barrett at oracle.com
>>>>      <mailto:kim.barrett at oracle.com>>
>>>>       >     <mailto:kim.barrett at oracle.com
>>>>      <mailto:kim.barrett at oracle.com> <mailto:kim.barrett at oracle.com
>>>>      <mailto:kim.barrett at oracle.com>>>> wrote:
>>>>       >      >>         >
>>>>       >      >>         >> On Jan 24, 2019, at 3:58 AM, Haoyu Li
>>>>       >     <leihouyju at gmail.com <mailto:leihouyju at gmail.com>
>>>>      <mailto:leihouyju at gmail.com <mailto:leihouyju at gmail.com>>
>>>>       >      >>         <mailto:leihouyju at gmail.com
>>>>      <mailto:leihouyju at gmail.com>
>>>>       >     <mailto:leihouyju at gmail.com <mailto:leihouyju at gmail.com>>>>
>>>>      wrote:
>>>>       >      >>         >>
>>>>       >      >>         >> Hi Kim,
>>>>       >      >>         >>
>>>>       >      >>         >> I have ported my patch to OpenJDK 13 
>>>> according
>>>>      to your
>>>>       >      >>         instructions in your last mail, and the 
>>>> patch is
>>>>      attached in
>>>>       >      >>         this mail. The patch does not change much since
>>>>      PSGC is
>>>>       >     indeed
>>>>       >      >>         pretty stable.
>>>>       >      >>         >>
>>>>       >      >>         >> Also, I evaluate the correctness and
>>>>      performance of
>>>>       >     PS full
>>>>       >      >>         GC with benchmarks from DaCapo, SPECjvm2008, 
>>>> and
>>>>      JOlden
>>>>       >     suits
>>>>       >      >>         on a machine with dual Intel Xeon E5-2618L v3
>>>> CPUs(16
>>>>       >     physical
>>>>       >      >>         cores), 64G DRAM and linux kernel 4.17. The
>>>>      evaluation
>>>>       >     result,
>>>>       >      >>         indicating 1.9X GC throughput improvement on
>>>>      average, is
>>>>       >      >>         attached, too.
>>>>       >      >>         >>
>>>>       >      >>         >> However, I have no idea how to further test
>>>> this
>>>>       >     patch for
>>>>       >      >>         both correctness and performance. Can I please
>>>>      get any
>>>>       >      >>         guidance from you or some sponsor?
>>>>       >      >>         >
>>>>       >      >>         > Sorry I missed that you had sent an updated
>>>>      version of the
>>>>       >      >>         patch.
>>>>       >      >>         >
>>>>       >      >>         > I’ve run the full regression suite across
>>>>      Oracle-supported
>>>>       >      >>         platforms.  There are some
>>>>       >      >>         > failures, but there are almost always some
>>>>      failures in the
>>>>       >      >>         later tiers right now.  I’ll start
>>>>       >      >>         > looking at them tomorrow to figure out 
>>>> whether
>>>>      any of them
>>>>       >      >>         are relevant.
>>>>       >      >>         >
>>>>       >      >>         > I’m also planning to run some of our 
>>>> performance
>>>>       >     benchmarks.
>>>>       >      >>         >
>>>>       >      >>         > I’ve lightly skimmed the proposed changes.
>>>>      There might be
>>>>       >      >>         some code improvements
>>>>       >      >>         > to be made.
>>>>       >      >>         >
>>>>       >      >>         > I’m also wondering if this technique 
>>>> applies to
>>>>      other
>>>>       >      >>         collectors.  It seems like both G1 and
>>>>       >      >>         > Shenandoah full gc’s might have similar
>>>>      issues?  If so, a
>>>>       >      >>         solution that is ParallelGC-specific
>>>>       >      >>         > is less interesting than one that has broader
>>>>       >      >>         applicability.  Though maybe this optimization
>>>>       >      >>         > is less important for G1 and Shenandoah, 
>>>> since
>>>> they
>>>>       >     actively
>>>>       >      >>         try to avoid full gc’s.
>>>>       >      >>         >
>>>>       >      >>         > I’m also not clear on how much additional
>>>>      memory might be
>>>>       >      >>         temporarily allocated by this
>>>>       >      >>         > mechanism.
>>>>       >      >>
>>>>       >      >>         I’ve created a CR for this:
>>>>       >      >> https://bugs.openjdk.java.net/browse/JDK-8220465
>>>>       >      >>
>>>>       >      >
>>>>       >
>>>>
>>>
>>
>>