[PATCH] Exploit Empty Regions in Young Gen to Enhance PS Full GC Performance

Thu Oct 10 13:50:56 UTC 2019

Thanks for the clarification =)

Moving on to the next part, the code in the patch. So this won't be a 
full review of the patch but just an initial comment that I would like 
to be addressed first.

The new function PSParallelCompact::fill_shadow_region() is more or less 
a copy of PSParallelCompact::fill_region() and I understand that from a 
proof of concept point of view it was the easy (and right) way to do it. 
I would prefer if the code could be refactored so that fill_region() and 
fill_shadow_region() share more code. There might be reasons that I've 
missed, that prevents it, but we should at least explore how much code 
can be shared.

Thanks,
Stefan

On 2019-10-10 15:10, Haoyu Li wrote:
> Hi Stefan,
> 
> Thanks for your quick response! As to your concern about the OCA, I am 
> the sole author of the patch. And it is the case as what the agreement 
> states.
> Best Regrads,
> Haoyu Li,
> 
> 
> Stefan Johansson <stefan.johansson at oracle.com 
> <mailto:stefan.johansson at oracle.com>> 于2019年10月10日周四 下午8:37写道：
> 
>     Hi,
> 
>     On 2019-10-10 13:06, Haoyu Li wrote:
>      > Hi Stefan,
>      >
>      > Thanks for your testing! One possible reason for the regressions in
>      > simple tests is that the region dependencies maybe not heavy enough.
>      > Because the locality of shadow regions is lower than that of heap
>      > regions, writing to shadow regions will be slower than to normal
>      > regions, and this is a part of the reason why I reuse shadow
>     regions.
>      > Therefore, if only a few shadow regions are created and not
>     reused, the
>      > overhead may not be amortized.
> 
>     I guess it is something like this. I thought that for "easy" heaps the
>     shadow regions won't be used at all, and should therefor not really
>     cost
>     anything.
> 
>      >
>      > As to the OCA, it is the case that I'm the only person signing the
>      > agreement. Please let me know if you have any further questions.
>     Thanks
>      > again!
> 
>     Ok, so you are the sole author of the patch. The important part, as the
>     agreement states, is:
>     "no other person or entity, including my employer, has or will have
>     rights with respect my contributions"
> 
>     Is that the case?
> 
>     Thanks,
>     Stefan
> 
>      >
>      > Best Regrads,
>      > Haoyu Li
>      >
>      > Stefan Johansson <stefan.johansson at oracle.com
>     <mailto:stefan.johansson at oracle.com>
>      > <mailto:stefan.johansson at oracle.com
>     <mailto:stefan.johansson at oracle.com>>> 于2019年10月8日周二 下午6:49
>     写道：
>      >
>      >     Hi Haoyu,
>      >
>      >     I've done some more testing and I haven't seen any issues
>     with the
>      >     patch
>      >     so far and the performance looks promising in most cases. For
>     simple
>      >     tests I've seen some regressions, but I'm not really sure
>     why. Will do
>      >     some more digging.
>      >
>      >     To move forward with this the first thing we need to do is
>     making sure
>      >     that you being covered by the Oracle Contributor Agreement is
>     enough.
>      >       From what we can see it is only you as an individual that
>     has signed
>      >     the OCA and in that case it is important that this statement
>     from the
>      >     OCA is fulfilled: "no other person or entity, including my
>     employer,
>      >     has
>      >     or will have rights with respect my contributions"
>      >
>      >     Is this the case for this contribution or should we have the
>     university
>      >     sign the OCA as well? For more information regarding the OCA
>     please
>      >     refer to:
>      > https://www.oracle.com/technetwork/oca-faq-405384.pdf
>      >
>      >     Thanks,
>      >     Stefan
>      >
>      >     On 2019-09-16 16:02, Haoyu Li wrote:
>      >      > FYI, the evaluation results on OpenJDK 14 are plotted in the
>      >     attachment.
>      >      > I compute the full GC throughput by dividing the heap size
>     before
>      >     full
>      >      > GC by the GC pause time, and the results are arithmetic mean
>      >     values of
>      >      > ten runs after a warm-up run. The evaluation is conducted on a
>      >     machine
>      >      > with dual Intel ®XeonTM E5-2618L v3 CPUs (2 sockets, 16
>     physical
>      >     cores
>      >      > with SMT enabled) and 64G DRAM.
>      >      >
>      >      > Best Regrads,
>      >      > Haoyu Li,
>      >      > Institute of Parallel and Distributed Systems(IPADS),
>      >      > School of Software,
>      >      > Shanghai Jiao Tong University
>      >      >
>      >      >
>      >      > Stefan Johansson <stefan.johansson at oracle.com
>     <mailto:stefan.johansson at oracle.com>
>      >     <mailto:stefan.johansson at oracle.com
>     <mailto:stefan.johansson at oracle.com>>
>      >      > <mailto:stefan.johansson at oracle.com
>     <mailto:stefan.johansson at oracle.com>
>      >     <mailto:stefan.johansson at oracle.com
>     <mailto:stefan.johansson at oracle.com>>>> 于2019年9月12日周四 上午5:34
>      >     写道：
>      >      >
>      >      >     Hi Haoyu,
>      >      >
>      >      >     I recently came across your patch and I would like to
>     pick up on
>      >      >     some of the things Kim mentioned in his mails. I
>     especially want
>      >      >     evaluate and investigate if this is a technique we can
>     use to
>      >      >     improve the other GCs as well. To start that work I
>     want to
>      >     take the
>      >      >     patch for a spin in our internal performance testing.
>     The patch
>      >      >     doesn’t apply clean to the latest JDK repository, so
>     if you could
>      >      >     provide an updated patch that would be very helpful.
>      >      >
>      >      >     It would also be great if you could share some more
>     information
>      >      >     around the results presented in the paper. For example, it
>      >     would be
>      >      >     good to get the full command lines for the different
>      >     benchmarks so
>      >      >     we can run them locally and reproduce the
>     results you’ve seen.
>      >      >
>      >      >     Thanks,
>      >      >     Stefan
>      >      >
>      >      >>     12 mars 2019 kl. 03:21 skrev Haoyu Li
>     <leihouyju at gmail.com <mailto:leihouyju at gmail.com>
>      >     <mailto:leihouyju at gmail.com <mailto:leihouyju at gmail.com>>
>      >      >>     <mailto:leihouyju at gmail.com
>     <mailto:leihouyju at gmail.com> <mailto:leihouyju at gmail.com
>     <mailto:leihouyju at gmail.com>>>>:
>      >      >>
>      >      >>     Hi Kim,
>      >      >>
>      >      >>     Thanks for reviewing and testing the patch. If there
>     are any
>      >      >>     failures or performance degradation relevant to the
>     work, please
>      >      >>     let me know and I'll be very happy to keep improving it.
>      >     Also, any
>      >      >>     suggestions about code improvements are well appreciated.
>      >      >>
>      >      >>     I'm not quite sure if both G1 and Shenandoah have the
>     similar
>      >      >>     region dependency issue, since I haven't studied their GC
>      >      >>     behaviors before. If they have, I'm also willing to
>     propose
>      >     a more
>      >      >>     general optimization.
>      >      >>
>      >      >>     As to the memory overhead, I believe it will be low
>     because this
>      >      >>     patch exploits empty regions in the young space
>     rather than
>      >      >>     off-heap memory to allocate shadow regions, and also
>     reuses the
>      >      >>     /_source_region/ field of each /RegionData /to record the
>      >      >>     correspongding shadow region index. We only introduce
>     a new
>      >      >>     integer filed /_shadow /in the RegionData class to
>     indicate the
>      >      >>     status of a region, a global /GrowableArray
>     _free_shadow/ to
>      >     store
>      >      >>     the indices of shadow regions, and a global
>     /Monitor/ to protect
>      >      >>     the array. These information might help if the memory
>     overhead
>      >      >>     need to be evaluated.
>      >      >>
>      >      >>     Looking forward to your insight.
>      >      >>
>      >      >>     Best Regrads,
>      >      >>     Haoyu Li,
>      >      >>     Institute of Parallel and Distributed Systems(IPADS),
>      >      >>     School of Software,
>      >      >>     Shanghai Jiao Tong University
>      >      >>
>      >      >>
>      >      >>     Kim Barrett <kim.barrett at oracle.com
>     <mailto:kim.barrett at oracle.com>
>      >     <mailto:kim.barrett at oracle.com <mailto:kim.barrett at oracle.com>>
>      >      >>     <mailto:kim.barrett at oracle.com
>     <mailto:kim.barrett at oracle.com>
>      >     <mailto:kim.barrett at oracle.com
>     <mailto:kim.barrett at oracle.com>>>> 于2019年3月12日周二 上午6:11写道：
>      >      >>
>      >      >>         > On Mar 11, 2019, at 1:45 AM, Kim Barrett
>      >      >>         <kim.barrett at oracle.com
>     <mailto:kim.barrett at oracle.com> <mailto:kim.barrett at oracle.com
>     <mailto:kim.barrett at oracle.com>>
>      >     <mailto:kim.barrett at oracle.com
>     <mailto:kim.barrett at oracle.com> <mailto:kim.barrett at oracle.com
>     <mailto:kim.barrett at oracle.com>>>> wrote:
>      >      >>         >
>      >      >>         >> On Jan 24, 2019, at 3:58 AM, Haoyu Li
>      >     <leihouyju at gmail.com <mailto:leihouyju at gmail.com>
>     <mailto:leihouyju at gmail.com <mailto:leihouyju at gmail.com>>
>      >      >>         <mailto:leihouyju at gmail.com
>     <mailto:leihouyju at gmail.com>
>      >     <mailto:leihouyju at gmail.com <mailto:leihouyju at gmail.com>>>>
>     wrote:
>      >      >>         >>
>      >      >>         >> Hi Kim,
>      >      >>         >>
>      >      >>         >> I have ported my patch to OpenJDK 13 according
>     to your
>      >      >>         instructions in your last mail, and the patch is
>     attached in
>      >      >>         this mail. The patch does not change much since
>     PSGC is
>      >     indeed
>      >      >>         pretty stable.
>      >      >>         >>
>      >      >>         >> Also, I evaluate the correctness and
>     performance of
>      >     PS full
>      >      >>         GC with benchmarks from DaCapo, SPECjvm2008, and
>     JOlden
>      >     suits
>      >      >>         on a machine with dual Intel Xeon E5-2618L v3 CPUs(16
>      >     physical
>      >      >>         cores), 64G DRAM and linux kernel 4.17. The
>     evaluation
>      >     result,
>      >      >>         indicating 1.9X GC throughput improvement on
>     average, is
>      >      >>         attached, too.
>      >      >>         >>
>      >      >>         >> However, I have no idea how to further test this
>      >     patch for
>      >      >>         both correctness and performance. Can I please
>     get any
>      >      >>         guidance from you or some sponsor?
>      >      >>         >
>      >      >>         > Sorry I missed that you had sent an updated
>     version of the
>      >      >>         patch.
>      >      >>         >
>      >      >>         > I’ve run the full regression suite across
>     Oracle-supported
>      >      >>         platforms.  There are some
>      >      >>         > failures, but there are almost always some
>     failures in the
>      >      >>         later tiers right now.  I’ll start
>      >      >>         > looking at them tomorrow to figure out whether
>     any of them
>      >      >>         are relevant.
>      >      >>         >
>      >      >>         > I’m also planning to run some of our performance
>      >     benchmarks.
>      >      >>         >
>      >      >>         > I’ve lightly skimmed the proposed changes. 
>     There might be
>      >      >>         some code improvements
>      >      >>         > to be made.
>      >      >>         >
>      >      >>         > I’m also wondering if this technique applies to
>     other
>      >      >>         collectors.  It seems like both G1 and
>      >      >>         > Shenandoah full gc’s might have similar
>     issues?  If so, a
>      >      >>         solution that is ParallelGC-specific
>      >      >>         > is less interesting than one that has broader
>      >      >>         applicability.  Though maybe this optimization
>      >      >>         > is less important for G1 and Shenandoah, since they
>      >     actively
>      >      >>         try to avoid full gc’s.
>      >      >>         >
>      >      >>         > I’m also not clear on how much additional
>     memory might be
>      >      >>         temporarily allocated by this
>      >      >>         > mechanism.
>      >      >>
>      >      >>         I’ve created a CR for this:
>      >      >> https://bugs.openjdk.java.net/browse/JDK-8220465
>      >      >>
>      >      >
>      >
>