Understanding High Object Copy Times

Tue May 2 05:02:35 UTC 2017

Nezih


On 05/01/2017 10:42 AM, nezih yigitbasi wrote:
> Hi,
> We have THP enabled for memory regions that explicitly request huge 
> pages using madvise.  We don't have swapping enabled. We run Presto (a 
> distributed SQL engine, http://prestodb.io) on Java 1.8.0_112 and the 
> machine has ~250GB memory and 12 physical cores.
> I have some followup questions:
> - I was always thinking that remembered sets were updated in the 
> "Update RS" phase, can you please elaborate on the difference between 
> rset update vs. object copy phases in terms of the rset processing 
> that is being done in each?
When the application allocates objects, the remember set is updated 
concurrently with the application. What can not be updated during the 
concurrent stage, will be updated during the pause 'Update RS'. But 
after evacuating the objects, we also need to fix the pointers of the 
objects that are moved.
> - Is high system cpu time a good indicator for high contention?
High system cpu time indicates some work in the OS kernel level.
>
> Thanks!
> Nezih
>
> 2017-04-28 15:44 GMT-07:00 yu.zhang at oracle.com 
> <mailto:yu.zhang at oracle.com> <yu.zhang at oracle.com 
> <mailto:yu.zhang at oracle.com>>:
>
>     Hi, Nezih,
>
>     I agree with your estimation. There is not much promotion going on.
>
>     Though I do not see high system cpu, if you are on a Linux system,
>     can you verify that the OS Transparent Huge Page is off?
>
>     Remember set is the data structure that keep track of pointers
>     from other regions to the region we are operating on. If there are
>     pointers from old regions to the objects in the young regions, we
>     can not collect that object.
>     After a gc pause, the live objects are evacuated to either the
>     survivor or old gen regions. Then we need to update the Remember
>     set. That is what I was referring to.
>     It is part of object copy.
>
>     Like Kirk mentioned, the parallelism of the gc threads seems fine.
>     Charlie.Hunt suggested maybe the object graph is very deep so that
>     the gc threads can not steal work and end up spinning.
>
>     Some things we can try:
>     -XX:MaxGCPauseMillis=<200> Maybe increase it to 400
>     Maybe try to reduce the -XX:ParallelGCThreads
>
>     |-XX:+G1SummarizeRSetStats -XX:G1SummarizeRSetStatsPeriod=10|
>
>     It is very expensive to print those statistics, for now we know
>     there is a lot of coarsening. We can make
>     G1SumarizeRSetStatsPeriod=100 The RSet footprint is so big, 
>     increasing -XX:G1RSetRegionEntries will get rid of coarsening but
>     make the memory footprint bigger. We can delay this for now. What
>     is your hardware/software configuration? cpu/memory/cores? There
>     is no swapping, right? ThanksJenny
>     On 04/27/2017 08:52 PM, nezih yigitbasi wrote:
>>     My first question is incorrect actually, so I am giving a better
>>     example to rephrase my question. At time
>>     "2017-04-26T03:12:24.259-0700"  there is a young GC that
>>     took 35.47 s, where object copy took 28983.4 ms. In that event I
>>     see the following log:
>>        [Eden: 7168.0M(7168.0M)->0.0B(7168.0M) Survivors:
>>     1024.0M->1024.0M Heap: 153.4G(160.0G)->149.4G(160.0G)]
>>     My interpretation for this is, ~4GB of garbage was collected from
>>     heap in total and we see that eden usage goes down by ~7GB, this
>>     means ~3GB of the eden was live objects. Is this interpretation
>>     correct? If it is, how come copying over 3GB takes ~29s? In your
>>     answer you said "most of the object copy time is dealing with the
>>     Remember Set", can you please give some details about what kind
>>     of operations on rsets are done during the object copy phase, and
>>     can we see that from these logs?
>>     Thanks again,
>>     Nezih
>>     2017-04-27 17:20 GMT-07:00 nezih yigitbasi
>>     <nezihyigitbasi at gmail.com <mailto:nezihyigitbasi at gmail.com>>:
>>
>>         Thanks for the suggestions. We use the default pause time.
>>         And here is our entire set of JVM args:
>>         https://gist.github.com/nezihyigitbasi/04f5fdb9c32ac56097011819e20602d8
>>         <https://gist.github.com/nezihyigitbasi/04f5fdb9c32ac56097011819e20602d8>
>>
>>         I have some followup questions:
>>         - In some case the object copy took 39406.8 ms, even if the
>>         remembered set is ~30GB isn't this too slow (that's <1GB/s of
>>         data)?
>>         - Is there any way to reduce the rset overhead?
>>         - My initial thought when I saw the high object copy times
>>         was there may be some sort of contention to have such a low
>>         throughput during the copy. Although it may not be the case
>>         here, I just wonder whether there is a way to see the amount
>>         of contention from the gc logs?
>>         Nezih
>>         2017-04-27 16:58 GMT-07:00 Jenny Zhang <yu.zhang at oracle.com
>>         <mailto:yu.zhang at oracle.com>>:
>>
>>             Hi, Hezih, It seems this workload is very heavy on
>>             Remember Set. It has about 31G native memory for RSet
>>             (old gen) and still with coarsening. What is you pause
>>             time goal? The default (200ms) might be too small for
>>             you.  Can you increase that so G1 can increase the young
>>             gen size? Since there is not much promotion, I guess most
>>             of the object copy time is dealing with the Remember Set.
>>             There are other things you can try, like increase the
>>             G1RSetReginEntries, but the memory footprint will be
>>             bigger. So if you can, I suggest increase the pause time
>>             goal first. ThanksJenny
>>             On 4/27/2017 9:22 AM, nezih yigitbasi wrote:
>>
>>                 Hi, We see huge object copy times (and relatively
>>                 high termination times) during young GCs in our
>>                 production system running on Java 1.8.0_112-b15. You
>>                 can find the GC logs here:
>>                 https://gist.github.com/nezihyigitbasi/1f7a92da7860908a611cb1197bd8626b
>>                 <https://gist.github.com/nezihyigitbasi/1f7a92da7860908a611cb1197bd8626b>
>>                 The young GC times start going high after the
>>                 timestamp "2017-04-26T03:07:22.164-0700". I will
>>                 appreciate if you can give some details about: - what
>>                 goes into the "Object Copy" phase during young GCs
>>                 and how we can reduce it. - why we see high
>>                 Termination times and what we can do about it Thanks,
>>                 Nezih 
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20170501/b8536dfd/attachment.htm>