From simone.bordet at gmail.com  Thu Jul 16 08:16:34 2015
From: simone.bordet at gmail.com (Simone Bordet)
Date: Thu, 16 Jul 2015 10:16:34 +0200
Subject: G1: SoftReference, 0 refs, 31.0027203 secs
Message-ID: <CAFWmRJ2SEiFfc_tBqixN2mHGEYFNSF_ttH-XNY+o+eva7+_zQw@mail.gmail.com>

Hi,

I have an application that reported very large (around 30 s) times to
process *zero* SoftReferences, for example:

6015.665: [SoftReference, 0 refs, 23.0169525 secs]6038.682:
[WeakReference, 1 refs, 0.0046033 secs]6038.687: [FinalReference,
31647 refs, 0.0090301 secs]6038.696: [PhantomReference, 241 refs,
0.0048419 secs]6038.701: [JNI Weak Reference, 0.0000463 secs],
23.2166772 secs]

We have been hit by this anomaly a few times now, and in the attached
logs (that also show the command line flags) it happened 3 times: at
uptimes 6015.512, 6074.487, 6141.161.

What happens after these long pauses is that G1 goes into "GC overhead
mode", tries to expand the heap (which fails because it's already
expanded), but keeps the Eden at a very small size, resulting in a
series of back-to-back collections that lasted almost 3 minutes where
the MMU dropped making the application almost unusable.
After that, G1 was able to recover to normal behavior.

I was wondering if anyone knows a little more about this issue (long
times to process zero soft references), or whether it has been fixed
in more recent releases.

We are not aware of any other process that could have caused this such
as busy disk I/O or swapping (the machine has plenty of memory left
and it is dedicated to the JVM), but we'll run jHiccup next time.
However, the fact that it always happened during the processing of
soft references seems suspicious.

Should I file an issue ?

Thanks !

-- 
Simone Bordet
http://bordet.blogspot.com
---
Finally, no matter how good the architecture and design are,
to deliver bug-free software with optimal performance and reliability,
the implementation technique must be flawless.   Victoria Livschitz
-------------- next part --------------
A non-text attachment was scrubbed...
Name: akiba-20150713-gc.log.gz
Type: application/x-gzip
Size: 594940 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20150716/c5f58489/akiba-20150713-gc.log-0001.gz>

From charlie.hunt at oracle.com  Mon Jul 20 20:24:17 2015
From: charlie.hunt at oracle.com (charlie hunt)
Date: Mon, 20 Jul 2015 15:24:17 -0500
Subject: G1: SoftReference, 0 refs, 31.0027203 secs
In-Reply-To: <CAFWmRJ2SEiFfc_tBqixN2mHGEYFNSF_ttH-XNY+o+eva7+_zQw@mail.gmail.com>
References: <CAFWmRJ2SEiFfc_tBqixN2mHGEYFNSF_ttH-XNY+o+eva7+_zQw@mail.gmail.com>
Message-ID: <0ABDB32D-AF14-4F37-8E7C-86C090DDEDC2@oracle.com>

Hi Simone,

Seems very peculiar to see 0 SoftReferences processed and an incredibly high reported time.

Couple questions popped in my mind as I looked through the logs.

I?m assuming this is on Linux?  If so, could you confirm THP (transparent huge pages) is disabled?

And, did you happen to try tuning -XXSoftRefLRUPolicyMSPerMB smaller from the default of 1000 to say something as low as 1 to see if what you?re seeing goes away ?

Perhaps someone on the GC team has some thoughts as a situation where we would see 0 SoftReferences processed yet a high amount of time spent there.

thanks,

charlie

> On Jul 16, 2015, at 3:16 AM, Simone Bordet <simone.bordet at gmail.com> wrote:
> 
> Hi,
> 
> I have an application that reported very large (around 30 s) times to
> process *zero* SoftReferences, for example:
> 
> 6015.665: [SoftReference, 0 refs, 23.0169525 secs]6038.682:
> [WeakReference, 1 refs, 0.0046033 secs]6038.687: [FinalReference,
> 31647 refs, 0.0090301 secs]6038.696: [PhantomReference, 241 refs,
> 0.0048419 secs]6038.701: [JNI Weak Reference, 0.0000463 secs],
> 23.2166772 secs]
> 
> We have been hit by this anomaly a few times now, and in the attached
> logs (that also show the command line flags) it happened 3 times: at
> uptimes 6015.512, 6074.487, 6141.161.
> 
> What happens after these long pauses is that G1 goes into "GC overhead
> mode", tries to expand the heap (which fails because it's already
> expanded), but keeps the Eden at a very small size, resulting in a
> series of back-to-back collections that lasted almost 3 minutes where
> the MMU dropped making the application almost unusable.
> After that, G1 was able to recover to normal behavior.
> 
> I was wondering if anyone knows a little more about this issue (long
> times to process zero soft references), or whether it has been fixed
> in more recent releases.
> 
> We are not aware of any other process that could have caused this such
> as busy disk I/O or swapping (the machine has plenty of memory left
> and it is dedicated to the JVM), but we'll run jHiccup next time.
> However, the fact that it always happened during the processing of
> soft references seems suspicious.
> 
> Should I file an issue ?
> 
> Thanks !
> 
> -- 
> Simone Bordet
> http://bordet.blogspot.com
> ---
> Finally, no matter how good the architecture and design are,
> to deliver bug-free software with optimal performance and reliability,
> the implementation technique must be flawless.   Victoria Livschitz
> <akiba-20150713-gc.log.gz>_______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use


From yanping.wang at intel.com  Mon Jul 20 20:49:19 2015
From: yanping.wang at intel.com (Wang, Yanping)
Date: Mon, 20 Jul 2015 20:49:19 +0000
Subject: hotspot-gc-use Digest, Vol 88, Issue 1
In-Reply-To: <mailman.3943.1437421765.14235.hotspot-gc-use@openjdk.java.net>
References: <mailman.3943.1437421765.14235.hotspot-gc-use@openjdk.java.net>
Message-ID: <222E9E27A7469F4FA2D137F0724FBD37A41E928B@ORSMSX105.amr.corp.intel.com>

Hi, Simone

Is the application you mentioned related to HDFS FIS and FOS?
>From the log, there are 31647 FinalReferences.
I think the first pause data after SoftReference includes overheads to Ref Proc. Maybe those FinalReferences are the problem?

One suggestion is, you can use jmap -dump:format=b,file=/home/test.hprof <PID> to collect heap profile, then use Eclipse MAT: Memory Analyzer (http://www.eclipse.org/mat/downloads.php)   to open .hprof file, Open Query Browser -> Java Basics -> References to see where those references come from. 

Thanks
-yanping

-----Original Message-----
From: hotspot-gc-use [mailto:hotspot-gc-use-bounces at openjdk.java.net] On Behalf Of hotspot-gc-use-request at openjdk.java.net
Sent: Monday, July 20, 2015 12:49 PM
To: hotspot-gc-use at openjdk.java.net
Subject: hotspot-gc-use Digest, Vol 88, Issue 1

Send hotspot-gc-use mailing list submissions to
	hotspot-gc-use at openjdk.java.net

To subscribe or unsubscribe via the World Wide Web, visit
	http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
or, via email, send a message with subject or body 'help' to
	hotspot-gc-use-request at openjdk.java.net

You can reach the person managing the list at
	hotspot-gc-use-owner at openjdk.java.net

When replying, please edit your Subject line so it is more specific
than "Re: Contents of hotspot-gc-use digest..."


Today's Topics:

   1. Re: Long Reference Processing Time (Tao Mao)
   2. G1: SoftReference, 0 refs, 31.0027203 secs (Simone Bordet)


----------------------------------------------------------------------

Message: 1
Date: Tue, 23 Jun 2015 17:42:19 -0700
From: Tao Mao <yiyeguhu at gmail.com>
To: Simone Bordet <simone.bordet at gmail.com>
Cc: "hotspot-gc-use at openjdk.java.net"
	<hotspot-gc-use at openjdk.java.net>
Subject: Re: Long Reference Processing Time
Message-ID:
	<CANrGW1zgEOX1+D=QeycMQA9mLMHbBoNCLRnwOpW3MYHvxhyvAQ at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Or, give Java Mission Control a try! -Tao

On Wed, May 20, 2015 at 3:52 PM, Simone Bordet <simone.bordet at gmail.com>
wrote:

> Hi,
>
> On Thu, May 21, 2015 at 12:41 AM, Joy Xiong <joyxiong at yahoo.com> wrote:
> > Is there other ways for this? It's a prod environment and it would be too
> > intrusive for a heap dump...
>
> We used our solution in production too, by enabling it for few minutes
> to collect data (via JMX) and then disabling it until the next restart
> (also via JMX), where it was removed.
> Required 2 restarts: one to add the instrumentation, and one to remove it.
>
> --
> Simone Bordet
> http://bordet.blogspot.com
> ---
> Finally, no matter how good the architecture and design are,
> to deliver bug-free software with optimal performance and reliability,
> the implementation technique must be flawless.   Victoria Livschitz
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20150623/83138a86/attachment-0001.html>

------------------------------

Message: 2
Date: Thu, 16 Jul 2015 10:16:34 +0200
From: Simone Bordet <simone.bordet at gmail.com>
To: "'hotspot-gc-use at openjdk.java.net'
	(hotspot-gc-use at openjdk.java.net)"	<hotspot-gc-use at openjdk.java.net>
Subject: G1: SoftReference, 0 refs, 31.0027203 secs
Message-ID:
	<CAFWmRJ2SEiFfc_tBqixN2mHGEYFNSF_ttH-XNY+o+eva7+_zQw at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi,

I have an application that reported very large (around 30 s) times to
process *zero* SoftReferences, for example:

6015.665: [SoftReference, 0 refs, 23.0169525 secs]6038.682:
[WeakReference, 1 refs, 0.0046033 secs]6038.687: [FinalReference,
31647 refs, 0.0090301 secs]6038.696: [PhantomReference, 241 refs,
0.0048419 secs]6038.701: [JNI Weak Reference, 0.0000463 secs],
23.2166772 secs]

We have been hit by this anomaly a few times now, and in the attached
logs (that also show the command line flags) it happened 3 times: at
uptimes 6015.512, 6074.487, 6141.161.

What happens after these long pauses is that G1 goes into "GC overhead
mode", tries to expand the heap (which fails because it's already
expanded), but keeps the Eden at a very small size, resulting in a
series of back-to-back collections that lasted almost 3 minutes where
the MMU dropped making the application almost unusable.
After that, G1 was able to recover to normal behavior.

I was wondering if anyone knows a little more about this issue (long
times to process zero soft references), or whether it has been fixed
in more recent releases.

We are not aware of any other process that could have caused this such
as busy disk I/O or swapping (the machine has plenty of memory left
and it is dedicated to the JVM), but we'll run jHiccup next time.
However, the fact that it always happened during the processing of
soft references seems suspicious.

Should I file an issue ?

Thanks !

-- 
Simone Bordet
http://bordet.blogspot.com
---
Finally, no matter how good the architecture and design are,
to deliver bug-free software with optimal performance and reliability,
the implementation technique must be flawless.   Victoria Livschitz
-------------- next part --------------
A non-text attachment was scrubbed...
Name: akiba-20150713-gc.log.gz
Type: application/x-gzip
Size: 594940 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20150716/c5f58489/akiba-20150713-gc.log.gz>

------------------------------

Subject: Digest Footer

_______________________________________________
hotspot-gc-use mailing list
hotspot-gc-use at openjdk.java.net
http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use


------------------------------

End of hotspot-gc-use Digest, Vol 88, Issue 1
*********************************************

From kim.barrett at oracle.com  Tue Jul 21 01:52:00 2015
From: kim.barrett at oracle.com (Kim Barrett)
Date: Mon, 20 Jul 2015 21:52:00 -0400
Subject: G1: SoftReference, 0 refs, 31.0027203 secs
In-Reply-To: <0ABDB32D-AF14-4F37-8E7C-86C090DDEDC2@oracle.com>
References: <CAFWmRJ2SEiFfc_tBqixN2mHGEYFNSF_ttH-XNY+o+eva7+_zQw@mail.gmail.com>
	<0ABDB32D-AF14-4F37-8E7C-86C090DDEDC2@oracle.com>
Message-ID: <7C3F2F89-0017-43AE-9022-B9AB1CD03FD6@oracle.com>

On Jul 20, 2015, at 4:24 PM, charlie hunt <charlie.hunt at oracle.com> wrote:
> 
> Hi Simone,
> 
> Seems very peculiar to see 0 SoftReferences processed and an incredibly high reported time.
> 
> Couple questions popped in my mind as I looked through the logs.
> 
> I?m assuming this is on Linux?  If so, could you confirm THP (transparent huge pages) is disabled?
> 
> And, did you happen to try tuning -XXSoftRefLRUPolicyMSPerMB smaller from the default of 1000 to say something as low as 1 to see if what you?re seeing goes away ?
> 
> Perhaps someone on the GC team has some thoughts as a situation where we would see 0 SoftReferences processed yet a high amount of time spent there.

We?ve seen other reports of long soft reference processing times, despite having none to process.  See, for example, email to this list from Joy Xiong, circa 5/20/2015, subject ?Long Reference Processing Time?.

I?ve spent some time looking at the reference processing code, but I don?t see anything in the reference processing code itself that would cause this.

However, there might be a possible mis-attribution of time here.  Soft references are the first references to be processed.  The phase1 reference processing first iterates over the (empty in this case) soft reference list.  It then calls the ?complete_gc? closure to process any mark stack entries added by that iteration.  But if there were already mark stack entries when reference processing was started, the time for processing them would be included in the soft reference processing time.

I *think* the mark stack (including the thread queues) ought to be empty when reference processing is started, but I?m not certain of that.  If it isn?t empty but is supposed to be, that would be a bug.  If it isn?t empty and that?s permitted, then the resulting mis-attribution of time is a bug.  And if it is empty but we still get unexpectedly long soft reference processing times then this hypothesis is falsified.

I don?t yet see a way to tell whether the mark stack is empty or to correct the time attribution that doesn?t involve patching the source code and rebuilding.


From jon.masamitsu at oracle.com  Wed Jul 22 17:36:06 2015
From: jon.masamitsu at oracle.com (Jon Masamitsu)
Date: Wed, 22 Jul 2015 10:36:06 -0700
Subject: G1: SoftReference, 0 refs, 31.0027203 secs
In-Reply-To: <CAFWmRJ2SEiFfc_tBqixN2mHGEYFNSF_ttH-XNY+o+eva7+_zQw@mail.gmail.com>
References: <CAFWmRJ2SEiFfc_tBqixN2mHGEYFNSF_ttH-XNY+o+eva7+_zQw@mail.gmail.com>
Message-ID: <55AFD486.9090502@oracle.com>


On 7/16/2015 1:16 AM, Simone Bordet wrote:
> Hi,
>
> I have an application that reported very large (around 30 s) times to
> process *zero* SoftReferences, for example:
>
> 6015.665: [SoftReference, 0 refs, 23.0169525 secs]6038.682:
> [WeakReference, 1 refs, 0.0046033 secs]6038.687: [FinalReference,
> 31647 refs, 0.0090301 secs]6038.696: [PhantomReference, 241 refs,
> 0.0048419 secs]6038.701: [JNI Weak Reference, 0.0000463 secs],
> 23.2166772 secs]
>
> We have been hit by this anomaly a few times now, and in the attached
> logs (that also show the command line flags) it happened 3 times: at
> uptimes 6015.512, 6074.487, 6141.161.

I noted that the entries for these times all have short user times and
long real times.

  [Times: user=3.63 sys=0.00, real=23.22 secs]
  [Times: user=1.54 sys=0.08, real=31.41 secs]
  [Times: user=2.68 sys=0.79, real=29.95 secs

I see that you don't expect other processes to be
interfering (no busy disk I/O or swapping as you say below)
but maybe something else is going on in the system
that is showing up as the SoftRef processing time.
I don't know why it would always show up as SoftRef
time though Kim suggests that it might be incorrect
attribution.  This happens on multiple systems?

Jon
>
> What happens after these long pauses is that G1 goes into "GC overhead
> mode", tries to expand the heap (which fails because it's already
> expanded), but keeps the Eden at a very small size, resulting in a
> series of back-to-back collections that lasted almost 3 minutes where
> the MMU dropped making the application almost unusable.
> After that, G1 was able to recover to normal behavior.
>
> I was wondering if anyone knows a little more about this issue (long
> times to process zero soft references), or whether it has been fixed
> in more recent releases.
>
> We are not aware of any other process that could have caused this such
> as busy disk I/O or swapping (the machine has plenty of memory left
> and it is dedicated to the JVM), but we'll run jHiccup next time.
> However, the fact that it always happened during the processing of
> soft references seems suspicious.
>
> Should I file an issue ?
>
> Thanks !
>
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20150722/4c8502d0/attachment.html>

From yiyeguhu at gmail.com  Wed Jul 22 17:52:28 2015
From: yiyeguhu at gmail.com (Tao Mao)
Date: Wed, 22 Jul 2015 10:52:28 -0700
Subject: G1 GC for 100GB+ heaps
In-Reply-To: <20150722135159.4CBEC1A981D@saffron.java-monitor.com>
References: <20150722135159.4CBEC1A981D@saffron.java-monitor.com>
Message-ID: <CANrGW1yJy=vMrmpLVAJaDTzTbCnVCrdx+g4oYmuWad4ukm6JSA@mail.gmail.com>

I had experience in tuning G1 with ~30GB in production environment (not as
large as you attempt :). I found it helpful to play around the following
JVM options as well as Java monitoring tools such as Jconsole/JMC:

-Xmx<nG> -Xms<nG>
-XX:MaxGCPauseMillis=<ms>
-XX:InitiatingHeapOccupancyPercent=<%>
-XX:+PrintReferenceGC
-XX:+ParallelRefProcEnabled
-XX:+PrintAdaptiveSizePolicy
-XX:ParallelGCThreads=n
-XX:ConcGCThreads=n
-XX:G1MixedGCCountTarget=n

As for region sizing, since you have a large heap, you may want to check
the relationship of region sizes and humongous object allocation, making
sure they are harmonious. If you suspect any problem in that end, you can
try G1PrintHeapRegions to diagnose.

Hope this helps.

changed cc to hotspot-gc-use

Thanks.
Tao Mao


On Wed, Jul 22, 2015 at 6:51 AM, Kees Jan Koster <kjkoster at java-monitor.com>
wrote:

> Dear All,
>
> Marcus Lagergren suggested I post these questions on this list. We are
> considering switching to using the G1 GC for a decently sized HBase
> cluster, and ran into some questions. Hope you can help me our, or point me
> to the place where I should ask.
>
> First -> sizing: Our machines have 128GB of RAM. We run on the bare metal.
> Is there a practical limit to the heap size we should allocate to the JVM
> when using the G1 GC? What kind of region sizing should we use, or should
> we just let G1 do what it does?
>
> Second -> failure modes. Does G1 have any failure or fall-back modes that
> it will use for edge cases? How do we monitor for those?
>
> Finally: Are there any gotcha?s to keep in mind, or any tunables that we
> have to invest time into when we want to run smoothly with 100GB+ heap
> sizes?
>
>
> --
> Kees Jan
>
> http://java-monitor.com/
> kjkoster at kjkoster.org
> +31651838192
>
> Change is good. Granted, it is good in retrospect, but change is good.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20150722/184c8859/attachment.html>

From kjkoster at gmail.com  Thu Jul 23 13:56:26 2015
From: kjkoster at gmail.com (Kees Jan Koster)
Date: Thu, 23 Jul 2015 16:56:26 +0300
Subject: G1 GC for 100GB+ heaps
In-Reply-To: <CANrGW1yJy=vMrmpLVAJaDTzTbCnVCrdx+g4oYmuWad4ukm6JSA@mail.gmail.com>
References: <20150722135159.4CBEC1A981D@saffron.java-monitor.com>
	<CANrGW1yJy=vMrmpLVAJaDTzTbCnVCrdx+g4oYmuWad4ukm6JSA@mail.gmail.com>
Message-ID: <94E5CBE6-CC26-4541-A2E9-25F997FF8DC8@gmail.com>

Dear Tao,

Thank you for the response. We will enable GC logging and see what we can learn from it.

> I had experience in tuning G1 with ~30GB in production environment (not as large as you attempt :). I found it helpful to play around the following JVM options as well as Java monitoring tools such as Jconsole/JMC:
> 
> -Xmx<nG> -Xms<nG>
> -XX:MaxGCPauseMillis=<ms>
> -XX:InitiatingHeapOccupancyPercent=<%>
> -XX:+PrintReferenceGC
> -XX:+ParallelRefProcEnabled
> -XX:+PrintAdaptiveSizePolicy
> -XX:ParallelGCThreads=n
> -XX:ConcGCThreads=n
> -XX:G1MixedGCCountTarget=n
> 
> As for region sizing, since you have a large heap, you may want to check the relationship of region sizes and humongous object allocation, making sure they are harmonious. If you suspect any problem in that end, you can try G1PrintHeapRegions to diagnose.


--
Kees Jan

http://java-monitor.com/
kjkoster at kjkoster.org
+31651838192

I hate unit tests; I much prefer the illusion that there are no errors in my code.
                                                                 -- Hendrik Muller

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 455 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20150723/34c4452b/signature.asc>

From kjkoster at gmail.com  Thu Jul 23 14:41:41 2015
From: kjkoster at gmail.com (Kees Jan Koster)
Date: Thu, 23 Jul 2015 17:41:41 +0300
Subject: G1 GC for 100GB+ heaps
In-Reply-To: <1437637807.2347.37.camel@oracle.com>
References: <20150722135159.4CBEC1A981D@saffron.java-monitor.com>
	<1437637807.2347.37.camel@oracle.com>
Message-ID: <639A5829-CC40-4C52-9EE6-770F124CE1E4@gmail.com>

Dear Thomas,

Thank you for the helpful response and for the links.

>> Marcus Lagergren suggested I post these questions on this list. We
>> are considering switching to using the G1 GC for a decently sized
>> HBase cluster, and ran into some questions. Hope you can help me
>> our, or point me to the place where I should ask.
> 
> This place is fine, although hotspot-gc-use might be more appropriate.

Moved CC there.

> However you do not mention what your goals are (throughput or latency or
> a mix of that), so it is hard to say whether G1 can meet your
> expectations.

Our goals are to limit pause times. Most traffic on the HBase cluster are from background jobs such as generating indexes and searching, but occasionally we retrieve a document synchronously from the web front-end, which we want to serve quickly.

Max pause times we aim for is 100ms, which looks to be entirely doable. Maybe we should set our goals a little more aggressively. ;-)

We have a test cluster running with -XX:MaxGCPauseMillis=100 but I found that this actually results in an *average* of 100ms and not a max. Is that observation correct? What am I misinterpreting?

>> What kind of region sizing
>> should we use, or should we just let G1 do what it does?
> 
> Initially we recommend just setting heap size (Xms/Xmx) and pause time
> goals (-XX:MaxGCPauseMillis). Depending on your results decreasing
> G1NewSizePercent and increasing the number of marking threads (see the
> first few links above).

Right, that?s what I heard from a few sources: set the size and pause time target and just leave it alone.

> Consider that the G1 needs some extra space for operation. So at 100G
> Java heap, and 128G RAM, the system might start to swap/thrash,
> particular if other stuff is running there. I.e. monitor that using e.g.
> vmstat. Should be avoided :)

Yes, these are dedicated machines and should never hit swap. We?ll keep an eye out to avoid the system hitting swap. Today the machines are running with 64GB heaps for that reason.

> If you are running on Linux, completely disable Transparent Huge Pages
> on Linux (use a search engine to get to know how it is done on your
> particular distro). Always, we have found no exceptions.

Thank you for that advice. I got the same advice from Kirk Pepperdine this week. Our systems actually run with transparent huge pages enabled and I?ll ask the guys to switch that off.

$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
$ _

> Other than that the above recommendations should be okay. If there are
> particular issues you may want to come back with a log of a problematic
> run with at least -XX:+PrintGCTimeStamps -XX:+PrintGCDetails set.

Thank you for the kind offer. It will be a few weeks before we get into the thick of this as the summer holidays are settling over us.

--
Kees Jan

kjkoster at java-monitor.com
http://java-monitor.com/
+31651838192

The secret of success lies in the stability of the goal. -- Benjamin Disraeli


--
Kees Jan

http://java-monitor.com/
kjkoster at kjkoster.org
+31651838192

Change is good. Granted, it is good in retrospect, but change is good.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 455 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20150723/ddc25ad4/signature.asc>

From ecki at zusammenkunft.net  Thu Jul 23 18:58:36 2015
From: ecki at zusammenkunft.net (Bernd Eckenfels)
Date: Thu, 23 Jul 2015 20:58:36 +0200
Subject: G1 GC for 100GB+ heaps
In-Reply-To: <639A5829-CC40-4C52-9EE6-770F124CE1E4@gmail.com>
References: <20150722135159.4CBEC1A981D@saffron.java-monitor.com>
	<1437637807.2347.37.camel@oracle.com>
	<639A5829-CC40-4C52-9EE6-770F124CE1E4@gmail.com>
Message-ID: <20150723205836.0000172c.ecki@zusammenkunft.net>

Am Thu, 23 Jul 2015 17:41:41 +0300
schrieb Kees Jan Koster <kjkoster at gmail.com>:

> Max pause times we aim for is 100ms, which looks to be entirely
> doable. Maybe we should set our goals a little more aggressively. ;-)

100ms sounds very aggressive for such large heaps, I would not expect
it. And its actually better to not aim for it then as well.

> We have a test cluster running with -XX:MaxGCPauseMillis=100 but I
> found that this actually results in an *average* of 100ms and not a
> max. Is that observation correct? What am I misinterpreting?

It is a hint and not very reliable at that. Neither as a max nor as a
average. But if you undershoot it will make things worse. Maybe try
200ms and see if your avarage does not change and the peaks become
flatter (not untypical).

Anyway, without seeing your logs and knowing your hardware specs its
hard to give more hints.

> > Consider that the G1 needs some extra space for operation. So at
> > 100G Java heap, and 128G RAM, the system might start to swap/thrash,
> > particular if other stuff is running there. I.e. monitor that using
> > e.g. vmstat. Should be avoided :)
> 
> Yes, these are dedicated machines and should never hit swap. We?ll
> keep an eye out to avoid the system hitting swap. Today the machines
> are running with 64GB heaps for that reason.

If you have 128GB and aim for a 100GB heap, keep 10GB for the rest of
the VM then you will have 10GB for filesystem cache. This should be
reflected in the swappiness setting for linux (5-10 for large machines
with application server load).

> > If you are running on Linux, completely disable Transparent Huge
> > Pages on Linux (use a search engine to get to know how it is done
> > on your particular distro). Always, we have found no exceptions.
> 
> Thank you for that advice. I got the same advice from Kirk Pepperdine
> this week. Our systems actually run with transparent huge pages
> enabled and I?ll ask the guys to switch that off.
> 
> $ cat /sys/kernel/mm/transparent_hugepage/enabled
> [always] madvise never

You might however reserve and turn on real large pages for the VM.
-XX:+UseLargePages. when using such a large heap it can safe a lot of
resources.

Gruss
Bernd

From kjkoster at gmail.com  Fri Jul 24 06:26:29 2015
From: kjkoster at gmail.com (Kees Jan Koster)
Date: Fri, 24 Jul 2015 09:26:29 +0300
Subject: G1 GC for 100GB+ heaps
In-Reply-To: <20150723205836.0000172c.ecki@zusammenkunft.net>
References: <20150722135159.4CBEC1A981D@saffron.java-monitor.com>
	<1437637807.2347.37.camel@oracle.com>
	<639A5829-CC40-4C52-9EE6-770F124CE1E4@gmail.com>
	<20150723205836.0000172c.ecki@zusammenkunft.net>
Message-ID: <94FBEEE0-D0EA-43A0-91A5-B5BF1F1A1581@gmail.com>

Dear Bernd

>> Max pause times we aim for is 100ms, which looks to be entirely
>> doable. Maybe we should set our goals a little more aggressively. ;-)
> 
> 100ms sounds very aggressive for such large heaps, I would not expect
> it. And its actually better to not aim for it then as well.

What would you find reasonable for a 100GB heap? And for a 200GB heap? I am not so much interested in the actual value you say, as in the ball-park you are in. We?ll be sure to test for what works best for us.

I?d rather not set it to such a value that the GC cannot work reliably, but we?d like it to be quick to support the front-end document fetch use-case.

What happens when we set this value too low? How does that restrict GC behaviour and what happens?

> Anyway, without seeing your logs and knowing your hardware specs its
> hard to give more hints.

I understand, but you have helped tremendously already. Next round of questions should be with some logs at hand.

> If you have 128GB and aim for a 100GB heap, keep 10GB for the rest of
> the VM then you will have 10GB for filesystem cache. This should be
> reflected in the swappiness setting for linux (5-10 for large machines
> with application server load).

We run with swappiness 10.

I don?t expect much from the file system cache, the access pattern is pretty random and we cache files that are read more than once on the batch processing servers already.

> You might however reserve and turn on real large pages for the VM.
> -XX:+UseLargePages. when using such a large heap it can safe a lot of
> resources.

Will do, thank you.

--
Kees Jan

http://java-monitor.com/
kjkoster at kjkoster.org
+31651838192

Human beings make life so interesting. Do you know that in a universe so full of wonders,
they have managed to invent boredom. Quite astonishing... -- Terry Pratchett

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 455 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20150724/f907f312/signature.asc>

From simone.bordet at gmail.com  Mon Jul 27 13:37:38 2015
From: simone.bordet at gmail.com (Simone Bordet)
Date: Mon, 27 Jul 2015 15:37:38 +0200
Subject: G1: SoftReference, 0 refs, 31.0027203 secs
In-Reply-To: <7C3F2F89-0017-43AE-9022-B9AB1CD03FD6@oracle.com>
References: <CAFWmRJ2SEiFfc_tBqixN2mHGEYFNSF_ttH-XNY+o+eva7+_zQw@mail.gmail.com>
	<0ABDB32D-AF14-4F37-8E7C-86C090DDEDC2@oracle.com>
	<7C3F2F89-0017-43AE-9022-B9AB1CD03FD6@oracle.com>
Message-ID: <CAFWmRJ3PCOnpRNu0WrTE5ENNOxy1iQPEsrP6CZfWGxOvE4OxTw@mail.gmail.com>

Kim,

On Tue, Jul 21, 2015 at 3:52 AM, Kim Barrett <kim.barrett at oracle.com> wrote:
> On Jul 20, 2015, at 4:24 PM, charlie hunt <charlie.hunt at oracle.com> wrote:
>>
>> Hi Simone,
>>
>> Seems very peculiar to see 0 SoftReferences processed and an incredibly high reported time.
>>
>> Couple questions popped in my mind as I looked through the logs.
>>
>> I?m assuming this is on Linux?  If so, could you confirm THP (transparent huge pages) is disabled?
>>
>> And, did you happen to try tuning -XXSoftRefLRUPolicyMSPerMB smaller from the default of 1000 to say something as low as 1 to see if what you?re seeing goes away ?
>>
>> Perhaps someone on the GC team has some thoughts as a situation where we would see 0 SoftReferences processed yet a high amount of time spent there.
>
> We?ve seen other reports of long soft reference processing times, despite having none to process.  See, for example, email to this list from Joy Xiong, circa 5/20/2015, subject ?Long Reference Processing Time?.
>
> I?ve spent some time looking at the reference processing code, but I don?t see anything in the reference processing code itself that would cause this.
>
> However, there might be a possible mis-attribution of time here.  Soft references are the first references to be processed.  The phase1 reference processing first iterates over the (empty in this case) soft reference list.  It then calls the ?complete_gc? closure to process any mark stack entries added by that iteration.  But if there were already mark stack entries when reference processing was started, the time for processing them would be included in the soft reference processing time.
>
> I *think* the mark stack (including the thread queues) ought to be empty when reference processing is started, but I?m not certain of that.  If it isn?t empty but is supposed to be, that would be a bug.  If it isn?t empty and that?s permitted, then the resulting mis-attribution of time is a bug.  And if it is empty but we still get unexpectedly long soft reference processing times then this hypothesis is falsified.
>
> I don?t yet see a way to tell whether the mark stack is empty or to correct the time attribution that doesn?t involve patching the source code and rebuilding.

Thanks for looking into this.

We are open to patching, rebuilding and testing, if we get guidance on
the repo to pull or other instructions to build the modified version.
Let us know if you have either modified code or indications for us to
modify the code.

Thanks !

-- 
Simone Bordet
http://bordet.blogspot.com
---
Finally, no matter how good the architecture and design are,
to deliver bug-free software with optimal performance and reliability,
the implementation technique must be flawless.   Victoria Livschitz

From simone.bordet at gmail.com  Mon Jul 27 13:55:29 2015
From: simone.bordet at gmail.com (Simone Bordet)
Date: Mon, 27 Jul 2015 15:55:29 +0200
Subject: G1: SoftReference, 0 refs, 31.0027203 secs
In-Reply-To: <55AFD486.9090502@oracle.com>
References: <CAFWmRJ2SEiFfc_tBqixN2mHGEYFNSF_ttH-XNY+o+eva7+_zQw@mail.gmail.com>
	<55AFD486.9090502@oracle.com>
Message-ID: <CAFWmRJ3iY9DCU6hiz+SroWuvPoVHP1QU_9rx_omp98To5cqUeg@mail.gmail.com>

Hi,

On Wed, Jul 22, 2015 at 7:36 PM, Jon Masamitsu <jon.masamitsu at oracle.com> wrote:
> I noted that the entries for these times all have short user times and
> long real times.
>
>  [Times: user=3.63 sys=0.00, real=23.22 secs]
>  [Times: user=1.54 sys=0.08, real=31.41 secs]
>  [Times: user=2.68 sys=0.79, real=29.95 secs
>
> I see that you don't expect other processes to be
> interfering (no busy disk I/O or swapping as you say below)
> but maybe something else is going on in the system
> that is showing up as the SoftRef processing time.
> I don't know why it would always show up as SoftRef
> time though Kim suggests that it might be incorrect
> attribution.  This happens on multiple systems?

There is only one system for now.
We are also looking more closely at disk I/O and THP. If something
pops up, I'll keep this list informed.

Thanks !

-- 
Simone Bordet
http://bordet.blogspot.com
---
Finally, no matter how good the architecture and design are,
to deliver bug-free software with optimal performance and reliability,
the implementation technique must be flawless.   Victoria Livschitz

From kim.barrett at oracle.com  Tue Jul 28 21:34:34 2015
From: kim.barrett at oracle.com (Kim Barrett)
Date: Tue, 28 Jul 2015 17:34:34 -0400
Subject: G1: SoftReference, 0 refs, 31.0027203 secs
In-Reply-To: <CAFWmRJ3PCOnpRNu0WrTE5ENNOxy1iQPEsrP6CZfWGxOvE4OxTw@mail.gmail.com>
References: <CAFWmRJ2SEiFfc_tBqixN2mHGEYFNSF_ttH-XNY+o+eva7+_zQw@mail.gmail.com>
	<0ABDB32D-AF14-4F37-8E7C-86C090DDEDC2@oracle.com>
	<7C3F2F89-0017-43AE-9022-B9AB1CD03FD6@oracle.com>
	<CAFWmRJ3PCOnpRNu0WrTE5ENNOxy1iQPEsrP6CZfWGxOvE4OxTw@mail.gmail.com>
Message-ID: <EF5ACEA3-C5E1-4678-8A06-6A4F8892AA2C@oracle.com>

On Jul 27, 2015, at 9:37 AM, Simone Bordet <simone.bordet at gmail.com> wrote:
> 
> On Tue, Jul 21, 2015 at 3:52 AM, Kim Barrett <kim.barrett at oracle.com> wrote:
>> On Jul 20, 2015, at 4:24 PM, charlie hunt <charlie.hunt at oracle.com> wrote:
>>> Perhaps someone on the GC team has some thoughts as a situation where we would see 0 SoftReferences processed yet a high amount of time spent there.
>> 
>> We?ve seen other reports of long soft reference processing times, despite having none to process.  See, for example, email to this list from Joy Xiong, circa 5/20/2015, subject ?Long Reference Processing Time?.
>> 
>> I?ve spent some time looking at the reference processing code, but I don?t see anything in the reference processing code itself that would cause this.
>> 
>> However, there might be a possible mis-attribution of time here.  Soft references are the first references to be processed.  The phase1 reference processing first iterates over the (empty in this case) soft reference list.  It then calls the ?complete_gc? closure to process any mark stack entries added by that iteration.  But if there were already mark stack entries when reference processing was started, the time for processing them would be included in the soft reference processing time.
>> 
>> I *think* the mark stack (including the thread queues) ought to be empty when reference processing is started, but I?m not certain of that.  If it isn?t empty but is supposed to be, that would be a bug.  If it isn?t empty and that?s permitted, then the resulting mis-attribution of time is a bug.  And if it is empty but we still get unexpectedly long soft reference processing times then this hypothesis is falsified.
>> 
>> I don?t yet see a way to tell whether the mark stack is empty or to correct the time attribution that doesn?t involve patching the source code and rebuilding.
> 
> Thanks for looking into this.
> 
> We are open to patching, rebuilding and testing, if we get guidance on
> the repo to pull or other instructions to build the modified version.
> Let us know if you have either modified code or indications for us to
> modify the code.

I?m going to try to come up with a patch that you could try.  You probably mentioned somewhere what Java version you are using, but I couldn?t find it.  Having that would give me the starting point for making a patch.


From simone.bordet at gmail.com  Tue Jul 28 21:38:24 2015
From: simone.bordet at gmail.com (Simone Bordet)
Date: Tue, 28 Jul 2015 23:38:24 +0200
Subject: G1: SoftReference, 0 refs, 31.0027203 secs
In-Reply-To: <EF5ACEA3-C5E1-4678-8A06-6A4F8892AA2C@oracle.com>
References: <CAFWmRJ2SEiFfc_tBqixN2mHGEYFNSF_ttH-XNY+o+eva7+_zQw@mail.gmail.com>
	<0ABDB32D-AF14-4F37-8E7C-86C090DDEDC2@oracle.com>
	<7C3F2F89-0017-43AE-9022-B9AB1CD03FD6@oracle.com>
	<CAFWmRJ3PCOnpRNu0WrTE5ENNOxy1iQPEsrP6CZfWGxOvE4OxTw@mail.gmail.com>
	<EF5ACEA3-C5E1-4678-8A06-6A4F8892AA2C@oracle.com>
Message-ID: <CAFWmRJ24L6byz3jH7X0vJGubJ_w3bEqo24403awHK=gFzQhRVg@mail.gmail.com>

Hi Kim,

On Tue, Jul 28, 2015 at 11:34 PM, Kim Barrett <kim.barrett at oracle.com> wrote:
> I?m going to try to come up with a patch that you could try.  You probably mentioned somewhere what Java version you are using, but I couldn?t find it.  Having that would give me the starting point for making a patch.

It's in the logs, 1.8.0_25-b17, but we're open to try 8u51 or even
8u60, whatever makes it easier for you.

Thanks !

-- 
Simone Bordet
http://bordet.blogspot.com
---
Finally, no matter how good the architecture and design are,
to deliver bug-free software with optimal performance and reliability,
the implementation technique must be flawless.   Victoria Livschitz

From srini_was at yahoo.com  Tue Jul 28 20:36:44 2015
From: srini_was at yahoo.com (Srini Padman)
Date: Tue, 28 Jul 2015 20:36:44 +0000 (UTC)
Subject: Object Copy and Termination times leading to long pauses
Message-ID: <1916430084.4436392.1438115804899.JavaMail.yahoo@mail.yahoo.com>

Hello,

We are seeing occasionally long young GC pauses, in the order of 20-25 seconds, in our application. When this happens, the pause generally occurs a few hours after the JVM starts. I am extracting the G1 GC settings and GC logs below.
The bulk of the time in these pauses seems to be associated with "Object Copy" and "Termination" phases - and I am not sure what to do about these. Any help you can offer will be greatly appreciated!

JVM settings:
--------------

-server -Xms4096m -Xmx4096m -Xss512k -XX:PermSize=128m? -XX:MaxPermSize=128m -XX:+UseG1GC -XX:G1HeapRegionSize=2m -XX:G1MixedGCLiveThresholdPercent=75 -XX:G1HeapWastePercent=5 -XX:InitiatingHeapOccupancyPercent=65 -XX:+ParallelRefProcEnabled -XX:+DisableExplicitGC -XX:+UnlockDiagnosticVMOptions -XX:+UnlockExperimentalVMOptions -XX:+ForceTimeHighResolution 

For the sake of completeness, I should add that we also have the following (logging) options:

-verbose:gc? -XX:+PrintAdaptiveSizePolicy -XX:+PrintGCDetails -XX:+PrintGCTimeStamps?? -XX:+PrintGCDateStamps -XX:-PrintTenuringDistribution -XX:+PrintPromotionFailure -XX:+G1PrintRegionLivenessInfo

GC log snippet (full file attached):
-------------------------------------

2015-07-22T21:02:54.488-0700: 47912.944: [GC pause (young) 47913.463: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 50724, predicted base time: 64.43 ms, remaining time: 135.57 ms, target pause time: 200.00 ms]
?47913.463: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 1178 regions, survivors: 50 regions, predicted young region time: 28.63 ms]
?47913.463: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 1178 regions, survivors: 50 regions, old: 0 regions, predicted pause time: 93.06 ms, target pause time: 200.00 ms]
, 24.2055085 secs]
?? [Parallel Time: 22056.7 ms, GC Workers: 8]
????? [GC Worker Start (ms): Min: 47913627.1, Avg: 47913687.6, Max: 47913745.7, Diff: 118.6]
????? [Ext Root Scanning (ms): Min: 652.7, Avg: 804.7, Max: 913.9, Diff: 261.2, Sum: 6437.8]
????? [Update RS (ms): Min: 1396.7, Avg: 1446.8, Max: 1496.6, Diff: 99.9, Sum: 11574.4]
???????? [Processed Buffers: Min: 13, Avg: 37.3, Max: 60, Diff: 47, Sum: 298]
????? [Scan RS (ms): Min: 0.1, Avg: 22.2, Max: 44.3, Diff: 44.2, Sum: 177.3]
????? [Code Root Scanning (ms): Min: 0.0, Avg: 0.7, Max: 3.9, Diff: 3.9, Sum: 5.5]
????? [Object Copy (ms): Min: 5649.2, Avg: 6518.0, Max: 6904.2, Diff: 1255.0, Sum: 52144.2]
????? [Termination (ms): Min: 12811.8, Avg: 13175.9, Max: 14032.8, Diff: 1221.0, Sum: 105406.8]
????? [GC Worker Other (ms): Min: 0.1, Avg: 5.7, Max: 23.4, Diff: 23.3, Sum: 45.3]
????? [GC Worker Total (ms): Min: 21915.9, Avg: 21973.9, Max: 22034.4, Diff: 118.5, Sum: 175791.3]
????? [GC Worker End (ms): Min: 47935661.5, Avg: 47935661.6, Max: 47935661.6, Diff: 0.2]
?? [Code Root Fixup: 0.2 ms]
?? [Code Root Migration: 0.5 ms]
?? [Clear CT: 227.8 ms]
?? [Other: 1920.3 ms]
????? [Choose CSet: 0.1 ms]
????? [Ref Proc: 646.0 ms]
????? [Ref Enq: 2.6 ms]
????? [Free CSet: 539.4 ms]
?? [Eden: 2356.0M(2356.0M)->0.0B(100.0M) Survivors: 100.0M->104.0M Heap: 3495.9M(4096.0M)->1150.8M(4096.0M)]
?[Times: user=103.32 sys=1.59, real=24.26 secs] 
?
?Regards,
?Srini.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20150728/498fe1c9/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gc-20150722_074417.zip
Type: application/zip
Size: 86418 bytes
Desc: not available
URL: <http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20150728/498fe1c9/gc-20150722_074417-0001.zip>

From gdesmet at redhat.com  Fri Jul 31 11:51:03 2015
From: gdesmet at redhat.com (Geoffrey De Smet)
Date: Fri, 31 Jul 2015 13:51:03 +0200
Subject: Make G1 the Default GC - not a good idea for heavy calculation use
	cases
Message-ID: <55BB6127.4070309@redhat.com>

Hi guys,

I've ran some benchmarks on OptaPlanner use cases with the latest OpenJDK 8
to asses the impact of switching the default to G1:
http://www.optaplanner.org/blog/2015/07/31/WhatIsTheFastestGarbageCollectorInJava8.html

Short summary: G1 is consistently worse in every use case for every 
dataset...

-- 

With kind regards,
Geoffrey De Smet