[PATCH] Exploit Empty Regions in Young Gen to Enhance PS Full GC Performance

Mon Oct 14 13:00:22 UTC 2019

Thanks for the quick update Haoyu,

This is a great improvement and I will try to find time to look into the 
patch in more detail the coming weeks.

Thanks,
Stefan

On 2019-10-11 14:49, Haoyu Li wrote:
> Hi Stefan,
> 
> Thanks for your suggestion! It is very redundant that
> PSParallelCompact::fill_shadow_region() copies most code from
> PSParallelCompact::fill_region(), and therefore I've refactored these
> two functions to share code as many as possible. And the attachment is
> the updated patch.
> 
> Specifically, the closure, which moves objects, in
> PSParallelCompact::fill_region() is now declared as a template of
> either MoveAndUpdateClosure or ShadowClosure. So by controlling the
> type of closure when invoking the function, we can decide whether to
> fill a normal region or a shadow one. Thus, almost all code in
> PSParallelCompact::fill_region() can be reused.
> 
> Besides, a virtual function named complete_region() is added in both
> closures to do some work after the filling, such setting states and
> copying the shadow region back.
> 
> Thanks again for reviewing the patch, looking forward to your insights
> and suggestions!
> 
> Best Regards,
> Haoyu Li
> 
> 2019-10-10 21:50 GMT+08:00, Stefan Johansson <stefan.johansson at oracle.com>:
>> Thanks for the clarification =)
>>
>> Moving on to the next part, the code in the patch. So this won't be a
>> full review of the patch but just an initial comment that I would like
>> to be addressed first.
>>
>> The new function PSParallelCompact::fill_shadow_region() is more or less
>> a copy of PSParallelCompact::fill_region() and I understand that from a
>> proof of concept point of view it was the easy (and right) way to do it.
>> I would prefer if the code could be refactored so that fill_region() and
>> fill_shadow_region() share more code. There might be reasons that I've
>> missed, that prevents it, but we should at least explore how much code
>> can be shared.
>>
>> Thanks,
>> Stefan
>>
>> On 2019-10-10 15:10, Haoyu Li wrote:
>>> Hi Stefan,
>>>
>>> Thanks for your quick response! As to your concern about the OCA, I am
>>> the sole author of the patch. And it is the case as what the agreement
>>> states.
>>> Best Regrads,
>>> Haoyu Li,
>>>
>>>
>>> Stefan Johansson <stefan.johansson at oracle.com
>>> <mailto:stefan.johansson at oracle.com>> 于2019年10月10日周四 下午8:37写道：
>>>
>>>      Hi,
>>>
>>>      On 2019-10-10 13:06, Haoyu Li wrote:
>>>       > Hi Stefan,
>>>       >
>>>       > Thanks for your testing! One possible reason for the regressions
>>> in
>>>       > simple tests is that the region dependencies maybe not heavy
>>> enough.
>>>       > Because the locality of shadow regions is lower than that of heap
>>>       > regions, writing to shadow regions will be slower than to normal
>>>       > regions, and this is a part of the reason why I reuse shadow
>>>      regions.
>>>       > Therefore, if only a few shadow regions are created and not
>>>      reused, the
>>>       > overhead may not be amortized.
>>>
>>>      I guess it is something like this. I thought that for "easy" heaps
>>> the
>>>      shadow regions won't be used at all, and should therefor not really
>>>      cost
>>>      anything.
>>>
>>>       >
>>>       > As to the OCA, it is the case that I'm the only person signing the
>>>       > agreement. Please let me know if you have any further questions.
>>>      Thanks
>>>       > again!
>>>
>>>      Ok, so you are the sole author of the patch. The important part, as
>>> the
>>>      agreement states, is:
>>>      "no other person or entity, including my employer, has or will have
>>>      rights with respect my contributions"
>>>
>>>      Is that the case?
>>>
>>>      Thanks,
>>>      Stefan
>>>
>>>       >
>>>       > Best Regrads,
>>>       > Haoyu Li
>>>       >
>>>       > Stefan Johansson <stefan.johansson at oracle.com
>>>      <mailto:stefan.johansson at oracle.com>
>>>       > <mailto:stefan.johansson at oracle.com
>>>      <mailto:stefan.johansson at oracle.com>>> 于2019年10月8日周二 下午6:49
>>>      写道：
>>>       >
>>>       >     Hi Haoyu,
>>>       >
>>>       >     I've done some more testing and I haven't seen any issues
>>>      with the
>>>       >     patch
>>>       >     so far and the performance looks promising in most cases. For
>>>      simple
>>>       >     tests I've seen some regressions, but I'm not really sure
>>>      why. Will do
>>>       >     some more digging.
>>>       >
>>>       >     To move forward with this the first thing we need to do is
>>>      making sure
>>>       >     that you being covered by the Oracle Contributor Agreement is
>>>      enough.
>>>       >       From what we can see it is only you as an individual that
>>>      has signed
>>>       >     the OCA and in that case it is important that this statement
>>>      from the
>>>       >     OCA is fulfilled: "no other person or entity, including my
>>>      employer,
>>>       >     has
>>>       >     or will have rights with respect my contributions"
>>>       >
>>>       >     Is this the case for this contribution or should we have the
>>>      university
>>>       >     sign the OCA as well? For more information regarding the OCA
>>>      please
>>>       >     refer to:
>>>       > https://www.oracle.com/technetwork/oca-faq-405384.pdf
>>>       >
>>>       >     Thanks,
>>>       >     Stefan
>>>       >
>>>       >     On 2019-09-16 16:02, Haoyu Li wrote:
>>>       >      > FYI, the evaluation results on OpenJDK 14 are plotted in
>>> the
>>>       >     attachment.
>>>       >      > I compute the full GC throughput by dividing the heap size
>>>      before
>>>       >     full
>>>       >      > GC by the GC pause time, and the results are arithmetic
>>> mean
>>>       >     values of
>>>       >      > ten runs after a warm-up run. The evaluation is conducted on
>>> a
>>>       >     machine
>>>       >      > with dual Intel ®XeonTM E5-2618L v3 CPUs (2 sockets, 16
>>>      physical
>>>       >     cores
>>>       >      > with SMT enabled) and 64G DRAM.
>>>       >      >
>>>       >      > Best Regrads,
>>>       >      > Haoyu Li,
>>>       >      > Institute of Parallel and Distributed Systems(IPADS),
>>>       >      > School of Software,
>>>       >      > Shanghai Jiao Tong University
>>>       >      >
>>>       >      >
>>>       >      > Stefan Johansson <stefan.johansson at oracle.com
>>>      <mailto:stefan.johansson at oracle.com>
>>>       >     <mailto:stefan.johansson at oracle.com
>>>      <mailto:stefan.johansson at oracle.com>>
>>>       >      > <mailto:stefan.johansson at oracle.com
>>>      <mailto:stefan.johansson at oracle.com>
>>>       >     <mailto:stefan.johansson at oracle.com
>>>      <mailto:stefan.johansson at oracle.com>>>> 于2019年9月12日周四 上午5:34
>>>       >     写道：
>>>       >      >
>>>       >      >     Hi Haoyu,
>>>       >      >
>>>       >      >     I recently came across your patch and I would like to
>>>      pick up on
>>>       >      >     some of the things Kim mentioned in his mails. I
>>>      especially want
>>>       >      >     evaluate and investigate if this is a technique we can
>>>      use to
>>>       >      >     improve the other GCs as well. To start that work I
>>>      want to
>>>       >     take the
>>>       >      >     patch for a spin in our internal performance testing.
>>>      The patch
>>>       >      >     doesn’t apply clean to the latest JDK repository, so
>>>      if you could
>>>       >      >     provide an updated patch that would be very helpful.
>>>       >      >
>>>       >      >     It would also be great if you could share some more
>>>      information
>>>       >      >     around the results presented in the paper. For example,
>>> it
>>>       >     would be
>>>       >      >     good to get the full command lines for the different
>>>       >     benchmarks so
>>>       >      >     we can run them locally and reproduce the
>>>      results you’ve seen.
>>>       >      >
>>>       >      >     Thanks,
>>>       >      >     Stefan
>>>       >      >
>>>       >      >>     12 mars 2019 kl. 03:21 skrev Haoyu Li
>>>      <leihouyju at gmail.com <mailto:leihouyju at gmail.com>
>>>       >     <mailto:leihouyju at gmail.com <mailto:leihouyju at gmail.com>>
>>>       >      >>     <mailto:leihouyju at gmail.com
>>>      <mailto:leihouyju at gmail.com> <mailto:leihouyju at gmail.com
>>>      <mailto:leihouyju at gmail.com>>>>:
>>>       >      >>
>>>       >      >>     Hi Kim,
>>>       >      >>
>>>       >      >>     Thanks for reviewing and testing the patch. If there
>>>      are any
>>>       >      >>     failures or performance degradation relevant to the
>>>      work, please
>>>       >      >>     let me know and I'll be very happy to keep improving
>>> it.
>>>       >     Also, any
>>>       >      >>     suggestions about code improvements are well
>>> appreciated.
>>>       >      >>
>>>       >      >>     I'm not quite sure if both G1 and Shenandoah have the
>>>      similar
>>>       >      >>     region dependency issue, since I haven't studied their
>>> GC
>>>       >      >>     behaviors before. If they have, I'm also willing to
>>>      propose
>>>       >     a more
>>>       >      >>     general optimization.
>>>       >      >>
>>>       >      >>     As to the memory overhead, I believe it will be low
>>>      because this
>>>       >      >>     patch exploits empty regions in the young space
>>>      rather than
>>>       >      >>     off-heap memory to allocate shadow regions, and also
>>>      reuses the
>>>       >      >>     /_source_region/ field of each /RegionData /to record
>>> the
>>>       >      >>     correspongding shadow region index. We only introduce
>>>      a new
>>>       >      >>     integer filed /_shadow /in the RegionData class to
>>>      indicate the
>>>       >      >>     status of a region, a global /GrowableArray
>>>      _free_shadow/ to
>>>       >     store
>>>       >      >>     the indices of shadow regions, and a global
>>>      /Monitor/ to protect
>>>       >      >>     the array. These information might help if the memory
>>>      overhead
>>>       >      >>     need to be evaluated.
>>>       >      >>
>>>       >      >>     Looking forward to your insight.
>>>       >      >>
>>>       >      >>     Best Regrads,
>>>       >      >>     Haoyu Li,
>>>       >      >>     Institute of Parallel and Distributed Systems(IPADS),
>>>       >      >>     School of Software,
>>>       >      >>     Shanghai Jiao Tong University
>>>       >      >>
>>>       >      >>
>>>       >      >>     Kim Barrett <kim.barrett at oracle.com
>>>      <mailto:kim.barrett at oracle.com>
>>>       >     <mailto:kim.barrett at oracle.com
>>> <mailto:kim.barrett at oracle.com>>
>>>       >      >>     <mailto:kim.barrett at oracle.com
>>>      <mailto:kim.barrett at oracle.com>
>>>       >     <mailto:kim.barrett at oracle.com
>>>      <mailto:kim.barrett at oracle.com>>>> 于2019年3月12日周二 上午6:11写道：
>>>       >      >>
>>>       >      >>         > On Mar 11, 2019, at 1:45 AM, Kim Barrett
>>>       >      >>         <kim.barrett at oracle.com
>>>      <mailto:kim.barrett at oracle.com> <mailto:kim.barrett at oracle.com
>>>      <mailto:kim.barrett at oracle.com>>
>>>       >     <mailto:kim.barrett at oracle.com
>>>      <mailto:kim.barrett at oracle.com> <mailto:kim.barrett at oracle.com
>>>      <mailto:kim.barrett at oracle.com>>>> wrote:
>>>       >      >>         >
>>>       >      >>         >> On Jan 24, 2019, at 3:58 AM, Haoyu Li
>>>       >     <leihouyju at gmail.com <mailto:leihouyju at gmail.com>
>>>      <mailto:leihouyju at gmail.com <mailto:leihouyju at gmail.com>>
>>>       >      >>         <mailto:leihouyju at gmail.com
>>>      <mailto:leihouyju at gmail.com>
>>>       >     <mailto:leihouyju at gmail.com <mailto:leihouyju at gmail.com>>>>
>>>      wrote:
>>>       >      >>         >>
>>>       >      >>         >> Hi Kim,
>>>       >      >>         >>
>>>       >      >>         >> I have ported my patch to OpenJDK 13 according
>>>      to your
>>>       >      >>         instructions in your last mail, and the patch is
>>>      attached in
>>>       >      >>         this mail. The patch does not change much since
>>>      PSGC is
>>>       >     indeed
>>>       >      >>         pretty stable.
>>>       >      >>         >>
>>>       >      >>         >> Also, I evaluate the correctness and
>>>      performance of
>>>       >     PS full
>>>       >      >>         GC with benchmarks from DaCapo, SPECjvm2008, and
>>>      JOlden
>>>       >     suits
>>>       >      >>         on a machine with dual Intel Xeon E5-2618L v3
>>> CPUs(16
>>>       >     physical
>>>       >      >>         cores), 64G DRAM and linux kernel 4.17. The
>>>      evaluation
>>>       >     result,
>>>       >      >>         indicating 1.9X GC throughput improvement on
>>>      average, is
>>>       >      >>         attached, too.
>>>       >      >>         >>
>>>       >      >>         >> However, I have no idea how to further test
>>> this
>>>       >     patch for
>>>       >      >>         both correctness and performance. Can I please
>>>      get any
>>>       >      >>         guidance from you or some sponsor?
>>>       >      >>         >
>>>       >      >>         > Sorry I missed that you had sent an updated
>>>      version of the
>>>       >      >>         patch.
>>>       >      >>         >
>>>       >      >>         > I’ve run the full regression suite across
>>>      Oracle-supported
>>>       >      >>         platforms.  There are some
>>>       >      >>         > failures, but there are almost always some
>>>      failures in the
>>>       >      >>         later tiers right now.  I’ll start
>>>       >      >>         > looking at them tomorrow to figure out whether
>>>      any of them
>>>       >      >>         are relevant.
>>>       >      >>         >
>>>       >      >>         > I’m also planning to run some of our performance
>>>       >     benchmarks.
>>>       >      >>         >
>>>       >      >>         > I’ve lightly skimmed the proposed changes.
>>>      There might be
>>>       >      >>         some code improvements
>>>       >      >>         > to be made.
>>>       >      >>         >
>>>       >      >>         > I’m also wondering if this technique applies to
>>>      other
>>>       >      >>         collectors.  It seems like both G1 and
>>>       >      >>         > Shenandoah full gc’s might have similar
>>>      issues?  If so, a
>>>       >      >>         solution that is ParallelGC-specific
>>>       >      >>         > is less interesting than one that has broader
>>>       >      >>         applicability.  Though maybe this optimization
>>>       >      >>         > is less important for G1 and Shenandoah, since
>>> they
>>>       >     actively
>>>       >      >>         try to avoid full gc’s.
>>>       >      >>         >
>>>       >      >>         > I’m also not clear on how much additional
>>>      memory might be
>>>       >      >>         temporarily allocated by this
>>>       >      >>         > mechanism.
>>>       >      >>
>>>       >      >>         I’ve created a CR for this:
>>>       >      >> https://bugs.openjdk.java.net/browse/JDK-8220465
>>>       >      >>
>>>       >      >
>>>       >
>>>
>>
> 
>