From fancyerii at gmail.com  Thu Mar  1 03:13:19 2012
From: fancyerii at gmail.com (Li Li)
Date: Thu, 1 Mar 2012 19:13:19 +0800
Subject: the memory usage of hotspot vm itself
Message-ID: <CAFAd71UQqLnr5jfVyb_+EZ7Zmi4LXn0Qqg-zxWx4aUNu=bLZyA@mail.gmail.com>

    I have an application running in a machine with 6 GB memory. And the
java program is allocated to use 4000M heap by -Xmx4000m and -Xms4000m. But
we found it use swap memory and system get slower.
When swapping, the total memory used by the java process is 4.6GB(use top
res 4.6G and virt 5.1G)
   I know Direct Buffer used by NIO, PermGen and stack is not limited by
Xmx. we limit stack 256k and 400 threads in peak. so it will use 80MB
memory and MaxPermSize is 256MB. MaxDirectMemorySize is default. I think is
64MB
all sumed up is about 3400MB. so JVM itself will need more than 200MB.
   any one tell me why JVM need more than 200MB memory?(I don't mean jvm
should not use that much memory). how could I estimate the memory usage of
JVM itself? thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120301/593ef591/attachment.html 

From ysr1729 at gmail.com  Thu Mar  1 09:44:12 2012
From: ysr1729 at gmail.com (Srinivas Ramakrishna)
Date: Thu, 1 Mar 2012 09:44:12 -0800
Subject: the memory usage of hotspot vm itself
In-Reply-To: <CAFAd71UQqLnr5jfVyb_+EZ7Zmi4LXn0Qqg-zxWx4aUNu=bLZyA@mail.gmail.com>
References: <CAFAd71UQqLnr5jfVyb_+EZ7Zmi4LXn0Qqg-zxWx4aUNu=bLZyA@mail.gmail.com>
Message-ID: <CABzyjy=tvkxjZvPH5c9DW3aywtruNMU=jjdeR-5rrKg62NibRg@mail.gmail.com>

Use pmap -s -x to look at what's in the virtual address space of yr process
and what portion thereof is resident in physical memory.
Make sure (in your sizing estimate) to leave enough RAM for
(1) other active processes in the system besides ye JVM
(2) your JVM's heap, perm gen, code cache
(3) your JVM's thtread stacks
(4) direct buffer memory and any other native memory used by yr process

-- ramki

On Thu, Mar 1, 2012 at 3:13 AM, Li Li <fancyerii at gmail.com> wrote:

>     I have an application running in a machine with 6 GB memory. And the
> java program is allocated to use 4000M heap by -Xmx4000m and -Xms4000m. But
> we found it use swap memory and system get slower.
> When swapping, the total memory used by the java process is 4.6GB(use top
> res 4.6G and virt 5.1G)
>    I know Direct Buffer used by NIO, PermGen and stack is not limited by
> Xmx. we limit stack 256k and 400 threads in peak. so it will use 80MB
> memory and MaxPermSize is 256MB. MaxDirectMemorySize is default. I think is
> 64MB
> all sumed up is about 3400MB. so JVM itself will need more than 200MB.
>    any one tell me why JVM need more than 200MB memory?(I don't mean jvm
> should not use that much memory). how could I estimate the memory usage of
> JVM itself? thanks.
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120301/d9fcddcd/attachment.html 

From ysr1729 at gmail.com  Thu Mar  1 13:37:12 2012
From: ysr1729 at gmail.com (Srinivas Ramakrishna)
Date: Thu, 1 Mar 2012 13:37:12 -0800
Subject: deferred updates? (was Re: value of dense prefix address used?)
Message-ID: <CABzyjy=zPMGCHPC4ovAOdC_X5tBQ1DhW2Ofk=DJr5wN7ZE3Rww@mail.gmail.com>

Turning off maximal compaction seems to have worked to get rid of the
outlier times, just as we had conjectured.
(Charlie may want to add that to the next edition of his book ;-) We'll see
how well it holds up.

On a related note (see changed the subject line), addressed mainly to John
Coomes (i think, and may be Jon Masa?),
why do we not update the pointers inside partial objects at the end of a
destination compaction region at the time
that we copy the object, rather than deferring that work for later? Can
something be done about this?
Secondly, (this would be moot if we didn't have to defer) even if this had
to be deferred, why aren't the deferred updates
done in parallel rather than single-threaded as done currently? Am I
missing something or is this wrinkle merely
an expediency that never got straightened out but is easily fixed? If so,
are there any plans to address this soon?
Certain experiments seem to indicate phase to be highly variable causing
parallel old gc's to be variable. It would
seem at this point that if this is addressed, the variance/outliers in
these times would shrink considerably.

Question to Charlie: have you seen this trend in any of your performance
explorations?

thanks!
-- ramki

On Sun, Feb 19, 2012 at 2:13 AM, Srinivas Ramakrishna <ysr1729 at gmail.com>wrote:

> Hi Jon --
>
> After looking at the code and the pattern we observed, we are pretty
> confident now that the maximal
> compaction was the root of the problem. We are going to effectively turn
> off the maximal compaction
> and see if it does any harm (don't think it will), and use that to work
> around the problem of extreme degradation
> when doing parallel compaction. It;s interesting why maximal compaction
> would degrade parallel compaction
> by so much... some experiments would be useful and perhaps help correct a
> specific issue the lack of initial parallelism
> may be causing to make the whole collection so much more inefficient.
> Hopefully we'll be able to collect some
> numbers that might help you folks address the issue.
>
> later.
> -- ramki
>
>
> On Fri, Feb 17, 2012 at 12:48 PM, Srinivas Ramakrishna <ysr1729 at gmail.com>wrote:
>
>> Hi John, thanks for those suggestions...
>>
>> So far the pattern has not repeated, but occurred on two different
>> servers (in each case it was the
>> same full gc ordinal too, albeit at different time). There didn't seem
>> anything external that would
>> explain the difference observed. Yes, we'll play around a bit with the
>> compaction related parameters and look at the phase times
>> as well. I am also looking at how the dense prefix address is computed to
>> see if it sheds a bit of
>> light may be, but it could also be something happening early in the life
>> of the process that doesn't happen
>> later that causes this... it's all a bit of a mystery at the moment.
>> Thanks!
>>
>> -- ramki
>>
>>
>> On Fri, Feb 17, 2012 at 12:10 PM, Jon Masamitsu <jon.masamitsu at oracle.com
>> > wrote:
>>
>>> **
>>> Ramki,
>>>
>>> I didn't find a product flag that would print the end of the dense
>>> prefix.
>>> Don't know about jmx.
>>>
>>> The phase accounting (PrintParallelOldGCPhaseTimes)
>>> as you say is a good place to start.  The summary phase is
>>> serial so look for an increase in that phase.   Does this pattern
>>> repeat?
>>>
>>> You could also try changing HeapMaximumCompactionInterval
>>> and see if it affects the pattern.
>>>
>>> Jon
>>>
>>>
>>> On 2/17/2012 9:46 AM, Srinivas Ramakrishna wrote:
>>>
>>> Hi Jo{h}n, all --
>>>
>>> Is there some way i can get at the dense prefix value used for ParOld in
>>> each (major) collection? I couldn't find an obvious product flag for
>>> eliciting that info, but wondered if you knew/remembered.
>>> JMX would be fine too -- as long as the info can be obtained in a product
>>> build.
>>>
>>> I am seeing a curious looking log where one specific major gc seems to have
>>> greater user and real time, lower "parallelism" [=(user+sys)/real] and
>>> takes much longer than the rest of the ParOld's. It
>>> also lowers the footprint a tad more (but definitely not proportionately
>>> more vis-a-vis the user time ratios) than the gc's preceding (but not
>>> succeeding) that one, so one conjecture was that perhaps
>>> something happens with the dense prefix computation at that time and we
>>> somehow end up copying more. We'll see if we can get some data with
>>> printing the ParOld phase times, but i wondered if
>>> we might also be able to elicit the dense prefix address/size. I'll
>>> continue to dig around meanwhile.
>>>
>>> thanks for any info!
>>> -- ramki
>>>
>>>
>>>
>>> _______________________________________________
>>> hotspot-gc-use mailing listhotspot-gc-use at openjdk.java.nethttp://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>
>>>
>>> _______________________________________________
>>> hotspot-gc-use mailing list
>>> hotspot-gc-use at openjdk.java.net
>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120301/78f03a1f/attachment.html 

From charlie.hunt at oracle.com  Thu Mar  1 14:22:35 2012
From: charlie.hunt at oracle.com (charlie hunt)
Date: Thu, 01 Mar 2012 16:22:35 -0600
Subject: deferred updates? (was Re: value of dense prefix address used?)
In-Reply-To: <CABzyjy=zPMGCHPC4ovAOdC_X5tBQ1DhW2Ofk=DJr5wN7ZE3Rww@mail.gmail.com>
References: <CABzyjy=zPMGCHPC4ovAOdC_X5tBQ1DhW2Ofk=DJr5wN7ZE3Rww@mail.gmail.com>
Message-ID: <4F4FF6AB.5060400@oracle.com>

An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120301/12b6b188/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3757 bytes
Desc: S/MIME Cryptographic Signature
Url : http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120301/12b6b188/smime-0001.p7s 

From Peter.B.Kessler at Oracle.COM  Thu Mar  1 15:27:48 2012
From: Peter.B.Kessler at Oracle.COM (Peter B. Kessler)
Date: Thu, 01 Mar 2012 15:27:48 -0800
Subject: deferred updates? (was Re: value of dense prefix address used?)
In-Reply-To: <4F4FF6AB.5060400@oracle.com>
References: <CABzyjy=zPMGCHPC4ovAOdC_X5tBQ1DhW2Ofk=DJr5wN7ZE3Rww@mail.gmail.com>
	<4F4FF6AB.5060400@oracle.com>
Message-ID: <4F5005F4.9010404@Oracle.COM>

With regard to ParOld updating pointers in partial objects: IIRC, that had to do with the fact that the object header might be in the (or maybe a) preceding region, and that region might not be in place yet, so we just copy the old bits and then update the split object once all the copying is finished and we know the fragments of the object are contiguous again.  I think we thought about (a) updating the interior oops that were in the same region as the object header but decided it wasn't worth it if we were going to have to do the rest of them later anyway, and (b) keeping a pointer to the klass someplace while the object was in pieces and having some kind of oop updater with memory bounds that would do isolated pieces as they moved into place.  Neither of those seemed worth it.  It seems like a small matter of programming to do the deferred updates in parallel: they are totally independent, would make a nice work list, etc..  That wouldn't address Jon's conjecture that unti
l "the gap" opens sufficiently (e.g., in a relatively dense prefix) there won't be much parallelism available.

			... peter

charlie hunt wrote:
>   Interesting findings Ramki!
> 
> I haven't seen a ParOld with low parallelism as you defined in your 
> original note.
> 
> Perhaps one of the reasons I haven't seen the "low parallelism" is I 
> just haven't run an application long enough to see a large number of 
> ParOld major GCs ?   But, I'll put one in the queue that will do just 
> that. ;-)
> 
> How many major GCs did you observe until you saw the "low parallelism" 
> major GC ?
> 
> Come to think of it, I do have a long running app with ParOld with 
> several major GCs.  But, I'll go double check the GC logs see what's there.
> 
> charlie ...
> 
> On 03/ 1/12 03:37 PM, Srinivas Ramakrishna wrote:
>> Turning off maximal compaction seems to have worked to get rid of the 
>> outlier times, just as we had conjectured.
>> (Charlie may want to add that to the next edition of his book ;-) 
>> We'll see how well it holds up.
>>
>> On a related note (see changed the subject line), addressed mainly to 
>> John Coomes (i think, and may be Jon Masa?),
>> why do we not update the pointers inside partial objects at the end of 
>> a destination compaction region at the time
>> that we copy the object, rather than deferring that work for later? 
>> Can something be done about this?
>> Secondly, (this would be moot if we didn't have to defer) even if this 
>> had to be deferred, why aren't the deferred updates
>> done in parallel rather than single-threaded as done currently? Am I 
>> missing something or is this wrinkle merely
>> an expediency that never got straightened out but is easily fixed? If 
>> so, are there any plans to address this soon?
>> Certain experiments seem to indicate phase to be highly variable 
>> causing parallel old gc's to be variable. It would
>> seem at this point that if this is addressed, the variance/outliers in 
>> these times would shrink considerably.
>>
>> Question to Charlie: have you seen this trend in any of your 
>> performance explorations?
>>
>> thanks!
>> -- ramki
>>
>> On Sun, Feb 19, 2012 at 2:13 AM, Srinivas Ramakrishna 
>> <ysr1729 at gmail.com <mailto:ysr1729 at gmail.com>> wrote:
>>
>>     Hi Jon --
>>
>>     After looking at the code and the pattern we observed, we are
>>     pretty confident now that the maximal
>>     compaction was the root of the problem. We are going to
>>     effectively turn off the maximal compaction
>>     and see if it does any harm (don't think it will), and use that to
>>     work around the problem of extreme degradation
>>     when doing parallel compaction. It;s interesting why maximal
>>     compaction would degrade parallel compaction
>>     by so much... some experiments would be useful and perhaps help
>>     correct a specific issue the lack of initial parallelism
>>     may be causing to make the whole collection so much more
>>     inefficient. Hopefully we'll be able to collect some
>>     numbers that might help you folks address the issue.
>>
>>     later.
>>     -- ramki
>>
>>
>>     On Fri, Feb 17, 2012 at 12:48 PM, Srinivas Ramakrishna
>>     <ysr1729 at gmail.com <mailto:ysr1729 at gmail.com>> wrote:
>>
>>         Hi John, thanks for those suggestions...
>>
>>         So far the pattern has not repeated, but occurred on two
>>         different servers (in each case it was the
>>         same full gc ordinal too, albeit at different time). There
>>         didn't seem anything external that would
>>         explain the difference observed. Yes, we'll play around a bit
>>         with the compaction related parameters and look at the phase times
>>         as well. I am also looking at how the dense prefix address is
>>         computed to see if it sheds a bit of
>>         light may be, but it could also be something happening early
>>         in the life of the process that doesn't happen
>>         later that causes this... it's all a bit of a mystery at the
>>         moment. Thanks!
>>
>>         -- ramki
>>
>>
>>         On Fri, Feb 17, 2012 at 12:10 PM, Jon Masamitsu
>>         <jon.masamitsu at oracle.com <mailto:jon.masamitsu at oracle.com>>
>>         wrote:
>>
>>             Ramki,
>>
>>             I didn't find a product flag that would print the end of
>>             the dense prefix.
>>             Don't know about jmx.
>>
>>             The phase accounting (PrintParallelOldGCPhaseTimes)
>>             as you say is a good place to start.  The summary phase is
>>             serial so look for an increase in that phase.   Does this
>>             pattern
>>             repeat?
>>
>>             You could also try changing HeapMaximumCompactionInterval
>>             and see if it affects the pattern.
>>
>>             Jon
>>
>>
>>             On 2/17/2012 9:46 AM, Srinivas Ramakrishna wrote:
>>>             Hi Jo{h}n, all --
>>>
>>>             Is there some way i can get at the dense prefix value used for ParOld in
>>>             each (major) collection? I couldn't find an obvious product flag for
>>>             eliciting that info, but wondered if you knew/remembered.
>>>             JMX would be fine too -- as long as the info can be obtained in a product
>>>             build.
>>>
>>>             I am seeing a curious looking log where one specific major gc seems to have
>>>             greater user and real time, lower "parallelism" [=(user+sys)/real] and
>>>             takes much longer than the rest of the ParOld's. It
>>>             also lowers the footprint a tad more (but definitely not proportionately
>>>             more vis-a-vis the user time ratios) than the gc's preceding (but not
>>>             succeeding) that one, so one conjecture was that perhaps
>>>             something happens with the dense prefix computation at that time and we
>>>             somehow end up copying more. We'll see if we can get some data with
>>>             printing the ParOld phase times, but i wondered if
>>>             we might also be able to elicit the dense prefix address/size. I'll
>>>             continue to dig around meanwhile.
>>>
>>>             thanks for any info!
>>>             -- ramki
>>>
>>>
>>>
>>>             _______________________________________________
>>>             hotspot-gc-use mailing list
>>>             hotspot-gc-use at openjdk.java.net <mailto:hotspot-gc-use at openjdk.java.net>
>>>             http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>
>>             _______________________________________________
>>             hotspot-gc-use mailing list
>>             hotspot-gc-use at openjdk.java.net
>>             <mailto:hotspot-gc-use at openjdk.java.net>
>>             http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

From ysr1729 at gmail.com  Thu Mar  1 16:07:19 2012
From: ysr1729 at gmail.com (Srinivas Ramakrishna)
Date: Thu, 1 Mar 2012 16:07:19 -0800
Subject: deferred updates? (was Re: value of dense prefix address used?)
In-Reply-To: <4F5005F4.9010404@Oracle.COM>
References: <CABzyjy=zPMGCHPC4ovAOdC_X5tBQ1DhW2Ofk=DJr5wN7ZE3Rww@mail.gmail.com>
	<4F4FF6AB.5060400@oracle.com> <4F5005F4.9010404@Oracle.COM>
Message-ID: <CABzyjymVN45=L-Fpe2Ao57qpV4qoZFBtF-rq0QmVGaxW3D8-nA@mail.gmail.com>

Thanks for that background Peter; it makes sense! In the short term, then,
parallelizing the deferred updates should cut
down the times considerably, and provide an interim solution until a
smarter scheme is implemented to do the non-initial
pieces as you mention in (b) below.

As regards the dense prefix issue, initial data appears to indicate that
the dense prefix calculation is pretty good
and the gap opens up pretty quickly. (It doesn't only when we insist on
maximal compaction -- which by default today
is at the third and every 20th major cycle thereafter, which we are
currently working around.)

Charlie, to answer yr question, we are still trying to characterize the
incidence of deferred update phases and why
they only sometimes hit. I suspect it'll be entirely stochastic depending
on object sizes, their location and their
lifetimes rather than displaying any specific pattern.

thanks!
-- ramki

On Thu, Mar 1, 2012 at 3:27 PM, Peter B. Kessler <Peter.B.Kessler at oracle.com
> wrote:

> With regard to ParOld updating pointers in partial objects: IIRC, that had
> to do with the fact that the object header might be in the (or maybe a)
> preceding region, and that region might not be in place yet, so we just
> copy the old bits and then update the split object once all the copying is
> finished and we know the fragments of the object are contiguous again.  I
> think we thought about (a) updating the interior oops that were in the same
> region as the object header but decided it wasn't worth it if we were going
> to have to do the rest of them later anyway, and (b) keeping a pointer to
> the klass someplace while the object was in pieces and having some kind of
> oop updater with memory bounds that would do isolated pieces as they moved
> into place.  Neither of those seemed worth it.  It seems like a small
> matter of programming to do the deferred updates in parallel: they are
> totally independent, would make a nice work list, etc..  That wouldn't
> address Jon's conjecture that unti
> l "the gap" opens sufficiently (e.g., in a relatively dense prefix) there
> won't be much parallelism available.
>
>                        ... peter
>
> charlie hunt wrote:
>
>>  Interesting findings Ramki!
>>
>> I haven't seen a ParOld with low parallelism as you defined in your
>> original note.
>>
>> Perhaps one of the reasons I haven't seen the "low parallelism" is I just
>> haven't run an application long enough to see a large number of ParOld
>> major GCs ?   But, I'll put one in the queue that will do just that. ;-)
>>
>> How many major GCs did you observe until you saw the "low parallelism"
>> major GC ?
>>
>> Come to think of it, I do have a long running app with ParOld with
>> several major GCs.  But, I'll go double check the GC logs see what's there.
>>
>> charlie ...
>>
>> On 03/ 1/12 03:37 PM, Srinivas Ramakrishna wrote:
>>
>>> Turning off maximal compaction seems to have worked to get rid of the
>>> outlier times, just as we had conjectured.
>>> (Charlie may want to add that to the next edition of his book ;-) We'll
>>> see how well it holds up.
>>>
>>> On a related note (see changed the subject line), addressed mainly to
>>> John Coomes (i think, and may be Jon Masa?),
>>> why do we not update the pointers inside partial objects at the end of a
>>> destination compaction region at the time
>>> that we copy the object, rather than deferring that work for later? Can
>>> something be done about this?
>>> Secondly, (this would be moot if we didn't have to defer) even if this
>>> had to be deferred, why aren't the deferred updates
>>> done in parallel rather than single-threaded as done currently? Am I
>>> missing something or is this wrinkle merely
>>> an expediency that never got straightened out but is easily fixed? If
>>> so, are there any plans to address this soon?
>>> Certain experiments seem to indicate phase to be highly variable causing
>>> parallel old gc's to be variable. It would
>>> seem at this point that if this is addressed, the variance/outliers in
>>> these times would shrink considerably.
>>>
>>> Question to Charlie: have you seen this trend in any of your performance
>>> explorations?
>>>
>>> thanks!
>>> -- ramki
>>>
>>> On Sun, Feb 19, 2012 at 2:13 AM, Srinivas Ramakrishna <ysr1729 at gmail.com<mailto:
>>> ysr1729 at gmail.com>> wrote:
>>>
>>>    Hi Jon --
>>>
>>>    After looking at the code and the pattern we observed, we are
>>>    pretty confident now that the maximal
>>>    compaction was the root of the problem. We are going to
>>>    effectively turn off the maximal compaction
>>>    and see if it does any harm (don't think it will), and use that to
>>>    work around the problem of extreme degradation
>>>    when doing parallel compaction. It;s interesting why maximal
>>>    compaction would degrade parallel compaction
>>>    by so much... some experiments would be useful and perhaps help
>>>    correct a specific issue the lack of initial parallelism
>>>    may be causing to make the whole collection so much more
>>>    inefficient. Hopefully we'll be able to collect some
>>>    numbers that might help you folks address the issue.
>>>
>>>    later.
>>>    -- ramki
>>>
>>>
>>>    On Fri, Feb 17, 2012 at 12:48 PM, Srinivas Ramakrishna
>>>    <ysr1729 at gmail.com <mailto:ysr1729 at gmail.com>> wrote:
>>>
>>>        Hi John, thanks for those suggestions...
>>>
>>>        So far the pattern has not repeated, but occurred on two
>>>        different servers (in each case it was the
>>>        same full gc ordinal too, albeit at different time). There
>>>        didn't seem anything external that would
>>>        explain the difference observed. Yes, we'll play around a bit
>>>        with the compaction related parameters and look at the phase times
>>>        as well. I am also looking at how the dense prefix address is
>>>        computed to see if it sheds a bit of
>>>        light may be, but it could also be something happening early
>>>        in the life of the process that doesn't happen
>>>        later that causes this... it's all a bit of a mystery at the
>>>        moment. Thanks!
>>>
>>>        -- ramki
>>>
>>>
>>>        On Fri, Feb 17, 2012 at 12:10 PM, Jon Masamitsu
>>>        <jon.masamitsu at oracle.com <mailto:jon.masamitsu at oracle.**com<jon.masamitsu at oracle.com>
>>> >>
>>>
>>>        wrote:
>>>
>>>            Ramki,
>>>
>>>            I didn't find a product flag that would print the end of
>>>            the dense prefix.
>>>            Don't know about jmx.
>>>
>>>            The phase accounting (PrintParallelOldGCPhaseTimes)
>>>            as you say is a good place to start.  The summary phase is
>>>            serial so look for an increase in that phase.   Does this
>>>            pattern
>>>            repeat?
>>>
>>>            You could also try changing HeapMaximumCompactionInterval
>>>            and see if it affects the pattern.
>>>
>>>            Jon
>>>
>>>
>>>            On 2/17/2012 9:46 AM, Srinivas Ramakrishna wrote:
>>>
>>>>            Hi Jo{h}n, all --
>>>>
>>>>            Is there some way i can get at the dense prefix value used
>>>> for ParOld in
>>>>            each (major) collection? I couldn't find an obvious product
>>>> flag for
>>>>            eliciting that info, but wondered if you knew/remembered.
>>>>            JMX would be fine too -- as long as the info can be obtained
>>>> in a product
>>>>            build.
>>>>
>>>>            I am seeing a curious looking log where one specific major
>>>> gc seems to have
>>>>            greater user and real time, lower "parallelism"
>>>> [=(user+sys)/real] and
>>>>            takes much longer than the rest of the ParOld's. It
>>>>            also lowers the footprint a tad more (but definitely not
>>>> proportionately
>>>>            more vis-a-vis the user time ratios) than the gc's preceding
>>>> (but not
>>>>            succeeding) that one, so one conjecture was that perhaps
>>>>            something happens with the dense prefix computation at that
>>>> time and we
>>>>            somehow end up copying more. We'll see if we can get some
>>>> data with
>>>>            printing the ParOld phase times, but i wondered if
>>>>            we might also be able to elicit the dense prefix
>>>> address/size. I'll
>>>>            continue to dig around meanwhile.
>>>>
>>>>            thanks for any info!
>>>>            -- ramki
>>>>
>>>>
>>>>
>>>>            ______________________________**_________________
>>>>            hotspot-gc-use mailing list
>>>>            hotspot-gc-use at openjdk.java.**net<hotspot-gc-use at openjdk.java.net><mailto:
>>>> hotspot-gc-use@**openjdk.java.net <hotspot-gc-use at openjdk.java.net>>
>>>>            http://mail.openjdk.java.net/**mailman/listinfo/hotspot-gc-*
>>>> *use <http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use>
>>>>
>>>
>>>            ______________________________**_________________
>>>            hotspot-gc-use mailing list
>>>            hotspot-gc-use at openjdk.java.**net<hotspot-gc-use at openjdk.java.net>
>>>            <mailto:hotspot-gc-use@**openjdk.java.net<hotspot-gc-use at openjdk.java.net>
>>> >
>>>
>>>            http://mail.openjdk.java.net/**mailman/listinfo/hotspot-gc-**
>>> use <http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use>
>>>
>>>
>>>
>>>
>>>
>>>
>>> ______________________________**_________________
>>> hotspot-gc-use mailing list
>>> hotspot-gc-use at openjdk.java.**net <hotspot-gc-use at openjdk.java.net>
>>> http://mail.openjdk.java.net/**mailman/listinfo/hotspot-gc-**use<http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use>
>>>
>>
>>
>> ------------------------------**------------------------------**
>> ------------
>>
>>
>> ______________________________**_________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.**net <hotspot-gc-use at openjdk.java.net>
>> http://mail.openjdk.java.net/**mailman/listinfo/hotspot-gc-**use<http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120301/228c5c26/attachment-0001.html 

From John.Coomes at oracle.com  Thu Mar  1 18:06:27 2012
From: John.Coomes at oracle.com (John Coomes)
Date: Thu, 1 Mar 2012 18:06:27 -0800
Subject: deferred updates? (was Re: value of dense prefix address used?)
In-Reply-To: <CABzyjy=zPMGCHPC4ovAOdC_X5tBQ1DhW2Ofk=DJr5wN7ZE3Rww@mail.gmail.com>
References: <CABzyjy=zPMGCHPC4ovAOdC_X5tBQ1DhW2Ofk=DJr5wN7ZE3Rww@mail.gmail.com>
Message-ID: <20304.11043.842034.443978@oracle.com>

Srinivas Ramakrishna (ysr1729 at gmail.com) wrote:
> Turning off maximal compaction seems to have worked to get rid of the
> outlier times, just as we had conjectured.
> (Charlie may want to add that to the next edition of his book ;-) We'll see
> how well it holds up.

Interesting.  Maybe we can 'redefine' maximal compaction to be not
quite so maximal -- instead of periodically making the dense prefix
0-sized, just make it smaller than the normal dense prefix.

> On a related note (see changed the subject line), addressed mainly to John
> Coomes (i think, and may be Jon Masa?),
> why do we not update the pointers inside partial objects at the end of a
> destination compaction region at the time
> that we copy the object, rather than deferring that work for later? Can
> something be done about this?

I honestly don't recall why.  It may be as simple as that it will help
a little bit, but likely not too much, because the tail of the object
that extends onto subsequent regions still has to be deferred.
Certainly something can be done, but see below.

> Secondly, (this would be moot if we didn't have to defer) even if this had
> to be deferred, why aren't the deferred updates
> done in parallel rather than single-threaded as done currently? Am I
> missing something or is this wrinkle merely
> an expediency that never got straightened out but is easily fixed? If so,
> are there any plans to address this soon?

Doing the updates in parallel is one option, and easy to do.  Given
time to work on it, I'd actually prefer to eliminate the deferred
updates altogether, and instead handle them during the compaction
itself (more effort, but not a lot more).

-John

> Certain experiments seem to indicate phase to be highly variable causing
> parallel old gc's to be variable. It would
> seem at this point that if this is addressed, the variance/outliers in
> these times would shrink considerably.
> 
> Question to Charlie: have you seen this trend in any of your performance
> explorations?
> 
> thanks!
> -- ramki
> 
> On Sun, Feb 19, 2012 at 2:13 AM, Srinivas Ramakrishna <ysr1729 at gmail.com>wrote:
> 
> > Hi Jon --
> >
> > After looking at the code and the pattern we observed, we are pretty
> > confident now that the maximal
> > compaction was the root of the problem. We are going to effectively turn
> > off the maximal compaction
> > and see if it does any harm (don't think it will), and use that to work
> > around the problem of extreme degradation
> > when doing parallel compaction. It;s interesting why maximal compaction
> > would degrade parallel compaction
> > by so much... some experiments would be useful and perhaps help correct a
> > specific issue the lack of initial parallelism
> > may be causing to make the whole collection so much more inefficient.
> > Hopefully we'll be able to collect some
> > numbers that might help you folks address the issue.
> >
> > later.
> > -- ramki
> >
> >
> > On Fri, Feb 17, 2012 at 12:48 PM, Srinivas Ramakrishna <ysr1729 at gmail.com>wrote:
> >
> >> Hi John, thanks for those suggestions...
> >>
> >> So far the pattern has not repeated, but occurred on two different
> >> servers (in each case it was the
> >> same full gc ordinal too, albeit at different time). There didn't seem
> >> anything external that would
> >> explain the difference observed. Yes, we'll play around a bit with the
> >> compaction related parameters and look at the phase times
> >> as well. I am also looking at how the dense prefix address is computed to
> >> see if it sheds a bit of
> >> light may be, but it could also be something happening early in the life
> >> of the process that doesn't happen
> >> later that causes this... it's all a bit of a mystery at the moment.
> >> Thanks!
> >>
> >> -- ramki
> >>
> >>
> >> On Fri, Feb 17, 2012 at 12:10 PM, Jon Masamitsu <jon.masamitsu at oracle.com
> >> > wrote:
> >>
> >>> **
> >>> Ramki,
> >>>
> >>> I didn't find a product flag that would print the end of the dense
> >>> prefix.
> >>> Don't know about jmx.
> >>>
> >>> The phase accounting (PrintParallelOldGCPhaseTimes)
> >>> as you say is a good place to start.  The summary phase is
> >>> serial so look for an increase in that phase.   Does this pattern
> >>> repeat?
> >>>
> >>> You could also try changing HeapMaximumCompactionInterval
> >>> and see if it affects the pattern.
> >>>
> >>> Jon
> >>>
> >>>
> >>> On 2/17/2012 9:46 AM, Srinivas Ramakrishna wrote:
> >>>
> >>> Hi Jo{h}n, all --
> >>>
> >>> Is there some way i can get at the dense prefix value used for ParOld in
> >>> each (major) collection? I couldn't find an obvious product flag for
> >>> eliciting that info, but wondered if you knew/remembered.
> >>> JMX would be fine too -- as long as the info can be obtained in a product
> >>> build.
> >>>
> >>> I am seeing a curious looking log where one specific major gc seems to have
> >>> greater user and real time, lower "parallelism" [=(user+sys)/real] and
> >>> takes much longer than the rest of the ParOld's. It
> >>> also lowers the footprint a tad more (but definitely not proportionately
> >>> more vis-a-vis the user time ratios) than the gc's preceding (but not
> >>> succeeding) that one, so one conjecture was that perhaps
> >>> something happens with the dense prefix computation at that time and we
> >>> somehow end up copying more. We'll see if we can get some data with
> >>> printing the ParOld phase times, but i wondered if
> >>> we might also be able to elicit the dense prefix address/size. I'll
> >>> continue to dig around meanwhile.
> >>>
> >>> thanks for any info!
> >>> -- ramki
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> hotspot-gc-use mailing listhotspot-gc-use at openjdk.java.nethttp://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
> >>>
> >>>
> >>> _______________________________________________
> >>> hotspot-gc-use mailing list
> >>> hotspot-gc-use at openjdk.java.net
> >>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
> >>>
> >>>
> >>
> >
> 
> ----------------------------------------------------------------------
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

From rednaxelafx at gmail.com  Thu Mar  1 21:18:21 2012
From: rednaxelafx at gmail.com (Krystal Mok)
Date: Thu, 1 Mar 2012 21:18:21 -0800
Subject: Krystal Mok wants to chat
Message-ID: <CA+cQ+tS0eAe_niY3tVPDaOHiimZMoXPUNqqqfr_hJpGVHmYiuQ@mail.gmail.com>

-----------------------------------------------------------------------

Krystal Mok wants to stay in better touch using some of Google's coolest new
products.

If you already have Gmail or Google Talk, visit:
http://mail.google.com/mail/b-9a6c445bce-5db039c947-ni8Q4k5DsRMOtBCLXXN5U_iWIsM
You'll need to click this link to be able to chat with Krystal Mok.

To get Gmail - a free email account from Google with over 2,800 megabytes of
storage - and chat with Krystal Mok, visit:
http://mail.google.com/mail/a-9a6c445bce-5db039c947-ni8Q4k5DsRMOtBCLXXN5U_iWIsM

Gmail offers:
- Instant messaging right inside Gmail
- Powerful spam protection
- Built-in search for finding your messages and a helpful way of organizing
  emails into "conversations"
- No pop-up ads or untargeted banners - just text ads and related information
  that are relevant to the content of your messages

All this, and its yours for free. But wait, there's more! By opening a Gmail
account, you also get access to Google Talk, Google's instant messaging
service:

http://www.google.com/talk/

Google Talk offers:
- Web-based chat that you can use anywhere, without a download
- A contact list that's synchronized with your Gmail account
- Free, high quality PC-to-PC voice calls when you download the Google Talk
  client

We're working hard to add new features and make improvements, so we might also
ask for your comments and suggestions periodically. We appreciate your help in
making our products even better!

Thanks,
The Google Team

To learn more about Gmail and Google Talk, visit:
http://mail.google.com/mail/help/about.html
http://www.google.com/talk/about.html

(If clicking the URLs in this message does not work, copy and paste them into
the address bar of your browser).

From the.6th.month at gmail.com  Fri Mar  2 06:47:26 2012
From: the.6th.month at gmail.com (the.6th.month at gmail.com)
Date: Fri, 2 Mar 2012 22:47:26 +0800
Subject: jvm swap issue
Message-ID: <CAKzy53mddC1N2SLQaZACEWZkvRkbREH-UmiuCjWvwBmZtYuzrQ@mail.gmail.com>

hi,hi:
I've just come across a weird situation where JVM ate swap space
occasionally even when there's free memory available. Has anyone got any
idea about how to diagnose such problem?
The output of top is as follows, and it is sorted by memory usage (shift+M):

top - 22:36:16 up 102 days, 10:49,  2 users,  load average: 1.68, 1.39, 1.34
Tasks:  98 total,   2 running,  96 sleeping,   0 stopped,   0 zombie
Cpu(s):  3.9%us,  0.6%sy,  0.0%ni, 94.4%id,  0.0%wa,  0.0%hi,  0.3%si,
0.8%st
Mem:   6291456k total,  6276292k used,    15164k free,    16528k buffers
Swap:  4192924k total,    39264k used,  4153660k free,   836288k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
COMMAND

  677 root      18   0 5259m 4.8g  10m S 18.9 79.5 398:10.95
java

 1721 root      34  19  253m 5356 2196 S  0.0  0.1   0:07.81
yum-updatesd

 1521 ntp       15   0 23412 5044 3916 S  0.0  0.1   0:06.03
ntpd

 1482 root      15   0  154m 4512 3000 S  0.0  0.1   1:31.02
snmpd

14006 root      17   0 88080 3264 2552 S  0.0  0.1   0:00.01
sshd

14053 root      18   0 88080 3264 2552 S  0.0  0.1   0:00.00
sshd

13088 postfix   15   0 54244 2300 1796 S  0.0  0.0   0:00.00
pickup

 1592 postfix   15   0 54420 1936 1816 S  0.0  0.0   0:00.09
qmgr

 1580 root      15   0 54180 1828 1736 S  0.0  0.0   0:00.50
master

14008 yue.liu   15   0 88080 1716  976 R  0.0  0.0   0:00.04
sshd

14055 yue.liu   15   0 88080 1696  972 S  0.0  0.0   0:00.01
sshd

14056 yue.liu   15   0 66096 1596 1204 S  0.0  0.0   0:00.01
bash

14159 root      15   0 66096 1580 1192 S  0.0  0.0   0:00.00
bash

14101 root      15   0 66096 1576 1196 S  0.0  0.0   0:00.01
bash

14009 yue.liu   15   0 66096 1572 1184 S  0.0  0.0   0:00.01
bash

 1411 haldaemo  15   0 30660 1252 1124 S  0.0  0.0   0:00.14 hald

and the swap usage seems to come from the address space of:
2aaad4000000-2aaad7ffc000 rwxp 2aaad4000000 00:00 0
Size:             65520 kB
Rss:              50476 kB
Shared_Clean:         0 kB
Shared_Dirty:         0 kB
Private_Clean:        0 kB
Private_Dirty:    50476 kB
Swap:             10192 kB
Pss:              50476 kB
according to the output of /proc/pid/smaps.

the output of jmap is as below:
[root at l-tw14 ~]# jmap 677
Attaching to process ID 677, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 20.1-b02
0x0000000040000000    49K    /home/q/java/jdk1.6.0_26/bin/java
0x0000003bf5600000    136K    /lib64/ld-2.5.so
0x0000003bf5a00000    1681K    /lib64/libc-2.5.so
0x0000003bf5e00000    22K    /lib64/libdl-2.5.so
0x0000003bf6200000    142K    /lib64/libpthread-2.5.so
0x0000003bf6600000    600K    /lib64/libm-2.5.so
0x0000003bf7200000    52K    /lib64/librt-2.5.so
0x0000003bf8600000    111K    /lib64/libnsl-2.5.so
0x0000003bfbe00000    90K    /lib64/libresolv-2.5.so
0x00002aaaaaab6000    65K
/home/q/java/jdk1.6.0_26/jre/lib/amd64/libverify.so
0x00002aaaaabc5000    229K
/home/q/java/jdk1.6.0_26/jre/lib/amd64/libjava.so
0x00002aaaaad00000    52K    /lib64/libnss_files-2.5.so
0x00002aaaaaf0b000    90K
/home/q/java/jdk1.6.0_26/jre/lib/amd64/libzip.so
0x00002aaab2dcc000    38K
/home/q/java/jdk1.6.0_26/jre/lib/amd64/libmanagement.so
0x00002aaab2ed3000    110K
/home/q/java/jdk1.6.0_26/jre/lib/amd64/libnet.so
0x00002aaab3504000    23K    /lib64/libnss_dns-2.5.so
0x00002aaab3709000    43K
/home/q/java/jdk1.6.0_26/jre/lib/amd64/libnio.so
0x00002aaab3818000    744K
/home/q/java/jdk1.6.0_26/jre/lib/amd64/libawt.so
0x00002aaab39e7000    33K
/home/q/java/jdk1.6.0_26/jre/lib/amd64/headless/libmawt.so
0x00002aaab3aed000    655K
/home/q/java/jdk1.6.0_26/jre/lib/amd64/libfontmanager.so
0x00002aaabc2fc000    718K
/home/q/java/jdk1.6.0_26/jre/lib/amd64/libmlib_image.so
0x00002aaabc4ab000    221K
/home/q/java/jdk1.6.0_26/jre/lib/amd64/libjpeg.so
0x00002ac6d0e04000    47K
/home/q/java/jdk1.6.0_26/jre/lib/amd64/jli/libjli.so
0x00002ac6d0f11000    13027K
/home/q/java/jdk1.6.0_26/jre/lib/amd64/server/libjvm.so

[root at l-tw14 ~]# jmap -heap 677
Attaching to process ID 677, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 20.1-b02

using thread-local object allocation.
Parallel GC with 4 thread(s)

Heap Configuration:
   MinHeapFreeRatio = 40
   MaxHeapFreeRatio = 70
   MaxHeapSize      = 4194304000 (4000.0MB)
   NewSize          = 268435456 (256.0MB)
   MaxNewSize       = 268435456 (256.0MB)
   OldSize          = 5439488 (5.1875MB)
   NewRatio         = 2
   SurvivorRatio    = 8
   PermSize         = 268435456 (256.0MB)
   MaxPermSize      = 268435456 (256.0MB)

Heap Usage:
PS Young Generation
Eden Space:
   capacity = 240779264 (229.625MB)
   used     = 42971864 (40.981163024902344MB)
   free     = 197807400 (188.64383697509766MB)
   17.84699532929879% used
>From Space:
   capacity = 1769472 (1.6875MB)
   used     = 1736704 (1.65625MB)
   free     = 32768 (0.03125MB)
   98.14814814814815% used
To Space:
   capacity = 13893632 (13.25MB)
   used     = 0 (0.0MB)
   free     = 13893632 (13.25MB)
   0.0% used
PS Old Generation
   capacity = 3925868544 (3744.0MB)
   used     = 3600153552 (3433.373977661133MB)
   free     = 325714992 (310.6260223388672MB)
   91.70336478795761% used
PS Perm Generation
   capacity = 268435456 (256.0MB)
   used     = 191475048 (182.6048355102539MB)
   free     = 76960408 (73.3951644897461MB)
   71.33001387119293% used

it seems that the heap usage is not that high enough to occupy swap space.
weird. And I also checked the nio usage, it was quite trivial:
[root at l-tw14 ~]#java -classpath .:$JAVA_HOME/lib/sa-jdi.jar
DirectMemorySize `pgrep java`
Attaching to process ID 677, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 20.1-b02
NIO direct memory:
 reserved size = 0.357348MB (374707 bytes)
 max size = 3968.000000MB (4160749568 bytes)

This problem bothered me quite a few days, and I appreciate all your help.
Thanks in advance
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120302/6ac7c9e2/attachment-0001.html 

From jon.masamitsu at oracle.com  Fri Mar  2 10:31:11 2012
From: jon.masamitsu at oracle.com (Jon Masamitsu)
Date: Fri, 02 Mar 2012 10:31:11 -0800
Subject: deferred updates? (was Re: value of dense prefix address used?)
In-Reply-To: <20304.11043.842034.443978@oracle.com>
References: <CABzyjy=zPMGCHPC4ovAOdC_X5tBQ1DhW2Ofk=DJr5wN7ZE3Rww@mail.gmail.com>
	<20304.11043.842034.443978@oracle.com>
Message-ID: <4F5111EF.4030209@oracle.com>



On 3/1/2012 6:06 PM, John Coomes wrote:
> ...
>> On a related note (see changed the subject line), addressed mainly to John
>> Coomes (i think, and may be Jon Masa?),
>> why do we not update the pointers inside partial objects at the end of a
>> destination compaction region at the time
>> that we copy the object, rather than deferring that work for later? Can
>> something be done about this?
> I honestly don't recall why.  It may be as simple as that it will help
> a little bit, but likely not too much, because the tail of the object
> that extends onto subsequent regions still has to be deferred.
> Certainly something can be done, but see below.

It was implemented with the deferred update because that was safe and 
simple.  I battled with
updating objects split over regions and until I ran out of time.  I 
believe the policy is "if this
object extends into a region to the left, defer the update".   With a 
little more work the policy
could be "if the klass word is in a region to the left, defer the update".

From jon.masamitsu at oracle.com  Fri Mar  2 10:41:17 2012
From: jon.masamitsu at oracle.com (Jon Masamitsu)
Date: Fri, 02 Mar 2012 10:41:17 -0800
Subject: the memory usage of hotspot vm itself
In-Reply-To: <CAFAd71UQqLnr5jfVyb_+EZ7Zmi4LXn0Qqg-zxWx4aUNu=bLZyA@mail.gmail.com>
References: <CAFAd71UQqLnr5jfVyb_+EZ7Zmi4LXn0Qqg-zxWx4aUNu=bLZyA@mail.gmail.com>
Message-ID: <4F51144D.7040704@oracle.com>

For GC there are data structures that the VM maintains that are
proportional to the size of the heap.  For example to facilitate
young gen collections there is a card table which is used to
find references from the old generation to the young generation
and the size of the card table is proportional to the size of
the heap.  Depending on the collector in use, there are
bitmaps that are also proportional to the size of the heap.

Is that what you are asking?

On 3/1/2012 3:13 AM, Li Li wrote:
>      I have an application running in a machine with 6 GB memory. And the
> java program is allocated to use 4000M heap by -Xmx4000m and -Xms4000m. But
> we found it use swap memory and system get slower.
> When swapping, the total memory used by the java process is 4.6GB(use top
> res 4.6G and virt 5.1G)
>     I know Direct Buffer used by NIO, PermGen and stack is not limited by
> Xmx. we limit stack 256k and 400 threads in peak. so it will use 80MB
> memory and MaxPermSize is 256MB. MaxDirectMemorySize is default. I think is
> 64MB
> all sumed up is about 3400MB. so JVM itself will need more than 200MB.
>     any one tell me why JVM need more than 200MB memory?(I don't mean jvm
> should not use that much memory). how could I estimate the memory usage of
> JVM itself? thanks.
>
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120302/8c255d2f/attachment.html 

From jamesnichols3 at gmail.com  Fri Mar  2 14:31:39 2012
From: jamesnichols3 at gmail.com (James Nichols)
Date: Fri, 2 Mar 2012 17:31:39 -0500
Subject: jvm swap issue
In-Reply-To: <CAKzy53mddC1N2SLQaZACEWZkvRkbREH-UmiuCjWvwBmZtYuzrQ@mail.gmail.com>
References: <CAKzy53mddC1N2SLQaZACEWZkvRkbREH-UmiuCjWvwBmZtYuzrQ@mail.gmail.com>
Message-ID: <CALqYbeN-XJHcyVftajJRJ6L2pOTYVDxsBxBY4nncF_umUXZcTA@mail.gmail.com>

What is the OS?

If it is Linux, the vm.swappiness = 0 option may help.

I've seen this situation before where the Linux OS will proactively swap
out areas of memory that haven't been used in a while.  With a large heap
and infrequent full garbage collections, the Linux virtual memory subsystem
will sometimes do this to pages underneath the JVM's OS-level virtual
memory.  If you then have a full garbage collection it will usually cause
major problems if any part of the heap is on swap space.

Typically what I do is set swappiness to 0, but not disable swap, and have
an alert go off on the OS level if it uses even a single block of swap
space.  That way there is at least a fail safe, since disabling swap can
make the system crash if it does run out of physical memory.

There are options to use Top that will show you how much of a process's
memory space is on swap, as well as page faults.  You don't want either.

Hope that helps.

Jim




On Fri, Mar 2, 2012 at 9:47 AM, the.6th.month at gmail.com <
the.6th.month at gmail.com> wrote:

> hi,hi:
> I've just come across a weird situation where JVM ate swap space
> occasionally even when there's free memory available. Has anyone got any
> idea about how to diagnose such problem?
> The output of top is as follows, and it is sorted by memory usage
> (shift+M):
>
> top - 22:36:16 up 102 days, 10:49,  2 users,  load average: 1.68, 1.39,
> 1.34
> Tasks:  98 total,   2 running,  96 sleeping,   0 stopped,   0 zombie
> Cpu(s):  3.9%us,  0.6%sy,  0.0%ni, 94.4%id,  0.0%wa,  0.0%hi,  0.3%si,
> 0.8%st
> Mem:   6291456k total,  6276292k used,    15164k free,    16528k buffers
> Swap:  4192924k total,    39264k used,  4153660k free,   836288k cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> COMMAND
>
>   677 root      18   0 5259m 4.8g  10m S 18.9 79.5 398:10.95
> java
>
>  1721 root      34  19  253m 5356 2196 S  0.0  0.1   0:07.81
> yum-updatesd
>
>  1521 ntp       15   0 23412 5044 3916 S  0.0  0.1   0:06.03
> ntpd
>
>  1482 root      15   0  154m 4512 3000 S  0.0  0.1   1:31.02
> snmpd
>
> 14006 root      17   0 88080 3264 2552 S  0.0  0.1   0:00.01
> sshd
>
> 14053 root      18   0 88080 3264 2552 S  0.0  0.1   0:00.00
> sshd
>
> 13088 postfix   15   0 54244 2300 1796 S  0.0  0.0   0:00.00
> pickup
>
>  1592 postfix   15   0 54420 1936 1816 S  0.0  0.0   0:00.09
> qmgr
>
>  1580 root      15   0 54180 1828 1736 S  0.0  0.0   0:00.50
> master
>
> 14008 yue.liu   15   0 88080 1716  976 R  0.0  0.0   0:00.04
> sshd
>
> 14055 yue.liu   15   0 88080 1696  972 S  0.0  0.0   0:00.01
> sshd
>
> 14056 yue.liu   15   0 66096 1596 1204 S  0.0  0.0   0:00.01
> bash
>
> 14159 root      15   0 66096 1580 1192 S  0.0  0.0   0:00.00
> bash
>
> 14101 root      15   0 66096 1576 1196 S  0.0  0.0   0:00.01
> bash
>
> 14009 yue.liu   15   0 66096 1572 1184 S  0.0  0.0   0:00.01
> bash
>
>  1411 haldaemo  15   0 30660 1252 1124 S  0.0  0.0   0:00.14 hald
>
> and the swap usage seems to come from the address space of:
> 2aaad4000000-2aaad7ffc000 rwxp 2aaad4000000 00:00 0
> Size:             65520 kB
> Rss:              50476 kB
> Shared_Clean:         0 kB
> Shared_Dirty:         0 kB
> Private_Clean:        0 kB
> Private_Dirty:    50476 kB
> Swap:             10192 kB
> Pss:              50476 kB
> according to the output of /proc/pid/smaps.
>
> the output of jmap is as below:
> [root at l-tw14 ~]# jmap 677
> Attaching to process ID 677, please wait...
> Debugger attached successfully.
> Server compiler detected.
> JVM version is 20.1-b02
> 0x0000000040000000    49K    /home/q/java/jdk1.6.0_26/bin/java
> 0x0000003bf5600000    136K    /lib64/ld-2.5.so
> 0x0000003bf5a00000    1681K    /lib64/libc-2.5.so
> 0x0000003bf5e00000    22K    /lib64/libdl-2.5.so
> 0x0000003bf6200000    142K    /lib64/libpthread-2.5.so
> 0x0000003bf6600000    600K    /lib64/libm-2.5.so
> 0x0000003bf7200000    52K    /lib64/librt-2.5.so
> 0x0000003bf8600000    111K    /lib64/libnsl-2.5.so
> 0x0000003bfbe00000    90K    /lib64/libresolv-2.5.so
> 0x00002aaaaaab6000    65K
> /home/q/java/jdk1.6.0_26/jre/lib/amd64/libverify.so
> 0x00002aaaaabc5000    229K
> /home/q/java/jdk1.6.0_26/jre/lib/amd64/libjava.so
> 0x00002aaaaad00000    52K    /lib64/libnss_files-2.5.so
> 0x00002aaaaaf0b000    90K
> /home/q/java/jdk1.6.0_26/jre/lib/amd64/libzip.so
> 0x00002aaab2dcc000    38K
> /home/q/java/jdk1.6.0_26/jre/lib/amd64/libmanagement.so
> 0x00002aaab2ed3000    110K
> /home/q/java/jdk1.6.0_26/jre/lib/amd64/libnet.so
> 0x00002aaab3504000    23K    /lib64/libnss_dns-2.5.so
> 0x00002aaab3709000    43K
> /home/q/java/jdk1.6.0_26/jre/lib/amd64/libnio.so
> 0x00002aaab3818000    744K
> /home/q/java/jdk1.6.0_26/jre/lib/amd64/libawt.so
> 0x00002aaab39e7000    33K
> /home/q/java/jdk1.6.0_26/jre/lib/amd64/headless/libmawt.so
> 0x00002aaab3aed000    655K
> /home/q/java/jdk1.6.0_26/jre/lib/amd64/libfontmanager.so
> 0x00002aaabc2fc000    718K
> /home/q/java/jdk1.6.0_26/jre/lib/amd64/libmlib_image.so
> 0x00002aaabc4ab000    221K
> /home/q/java/jdk1.6.0_26/jre/lib/amd64/libjpeg.so
> 0x00002ac6d0e04000    47K
> /home/q/java/jdk1.6.0_26/jre/lib/amd64/jli/libjli.so
> 0x00002ac6d0f11000    13027K
> /home/q/java/jdk1.6.0_26/jre/lib/amd64/server/libjvm.so
>
> [root at l-tw14 ~]# jmap -heap 677
> Attaching to process ID 677, please wait...
> Debugger attached successfully.
> Server compiler detected.
> JVM version is 20.1-b02
>
> using thread-local object allocation.
> Parallel GC with 4 thread(s)
>
> Heap Configuration:
>    MinHeapFreeRatio = 40
>    MaxHeapFreeRatio = 70
>    MaxHeapSize      = 4194304000 (4000.0MB)
>    NewSize          = 268435456 (256.0MB)
>    MaxNewSize       = 268435456 (256.0MB)
>    OldSize          = 5439488 (5.1875MB)
>    NewRatio         = 2
>    SurvivorRatio    = 8
>    PermSize         = 268435456 (256.0MB)
>    MaxPermSize      = 268435456 (256.0MB)
>
> Heap Usage:
> PS Young Generation
> Eden Space:
>    capacity = 240779264 (229.625MB)
>    used     = 42971864 (40.981163024902344MB)
>    free     = 197807400 (188.64383697509766MB)
>    17.84699532929879% used
> From Space:
>    capacity = 1769472 (1.6875MB)
>    used     = 1736704 (1.65625MB)
>    free     = 32768 (0.03125MB)
>    98.14814814814815% used
> To Space:
>    capacity = 13893632 (13.25MB)
>    used     = 0 (0.0MB)
>    free     = 13893632 (13.25MB)
>    0.0% used
> PS Old Generation
>    capacity = 3925868544 (3744.0MB)
>    used     = 3600153552 (3433.373977661133MB)
>    free     = 325714992 (310.6260223388672MB)
>    91.70336478795761% used
> PS Perm Generation
>    capacity = 268435456 (256.0MB)
>    used     = 191475048 (182.6048355102539MB)
>    free     = 76960408 (73.3951644897461MB)
>    71.33001387119293% used
>
> it seems that the heap usage is not that high enough to occupy swap space.
> weird. And I also checked the nio usage, it was quite trivial:
> [root at l-tw14 ~]#java -classpath .:$JAVA_HOME/lib/sa-jdi.jar
> DirectMemorySize `pgrep java`
> Attaching to process ID 677, please wait...
> Debugger attached successfully.
> Server compiler detected.
> JVM version is 20.1-b02
> NIO direct memory:
>  reserved size = 0.357348MB (374707 bytes)
>  max size = 3968.000000MB (4160749568 bytes)
>
> This problem bothered me quite a few days, and I appreciate all your help.
> Thanks in advance
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120302/08e29137/attachment.html 

From ysr1729 at gmail.com  Fri Mar  2 14:54:03 2012
From: ysr1729 at gmail.com (Srinivas Ramakrishna)
Date: Fri, 2 Mar 2012 14:54:03 -0800
Subject: deferred updates? (was Re: value of dense prefix address used?)
In-Reply-To: <4F5111EF.4030209@oracle.com>
References: <CABzyjy=zPMGCHPC4ovAOdC_X5tBQ1DhW2Ofk=DJr5wN7ZE3Rww@mail.gmail.com>
	<20304.11043.842034.443978@oracle.com>
	<4F5111EF.4030209@oracle.com>
Message-ID: <CABzyjy=UWGZWKtFPY_b2DucfpVzEwgyZi2CL31X01NZvkSi+5Q@mail.gmail.com>

Hi Peter, John, Jon -- Thanks!

How about in the very short term, as Peter suggested, and because it is
relatively simple, do the deferred
updates in parallel by all the worker threads claiming the deferred objects
in a "strided" fashion,
and for the longer term try and eliminate the phase, merging it into the
compaction phase by leaving
enough information in the summary table to deal with each fragment by
itself and having an
appropriate interval-aware oop-updater. Or do you feel, John, that the
latter is not that hard and
one might as well do that rather than the intermediate parallelization
(which would be thrown away
once the separate phase is done away with). Thinking a bit more about it,
it does appear
as though the latter may be the way to go. Would it be possible for you to
file a CR so we
have a "handle" by which to refer to this issue, and so this does not fall
between the cracks.
I am happy to help in any way I can, just let me know.

thanks!
-- ramki

On Fri, Mar 2, 2012 at 10:31 AM, Jon Masamitsu <jon.masamitsu at oracle.com>wrote:

>
>
> On 3/1/2012 6:06 PM, John Coomes wrote:
>
>> ...
>>
>>  On a related note (see changed the subject line), addressed mainly to
>>> John
>>> Coomes (i think, and may be Jon Masa?),
>>> why do we not update the pointers inside partial objects at the end of a
>>> destination compaction region at the time
>>> that we copy the object, rather than deferring that work for later? Can
>>> something be done about this?
>>>
>> I honestly don't recall why.  It may be as simple as that it will help
>> a little bit, but likely not too much, because the tail of the object
>> that extends onto subsequent regions still has to be deferred.
>> Certainly something can be done, but see below.
>>
>
> It was implemented with the deferred update because that was safe and
> simple.  I battled with
> updating objects split over regions and until I ran out of time.  I
> believe the policy is "if this
> object extends into a region to the left, defer the update".   With a
> little more work the policy
> could be "if the klass word is in a region to the left, defer the update".
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120302/31f35c6f/attachment-0001.html 

From the.6th.month at gmail.com  Mon Mar  5 01:44:06 2012
From: the.6th.month at gmail.com (the.6th.month at gmail.com)
Date: Mon, 5 Mar 2012 17:44:06 +0800
Subject: jvm swap issue
In-Reply-To: <CALqYbeN-XJHcyVftajJRJ6L2pOTYVDxsBxBY4nncF_umUXZcTA@mail.gmail.com>
References: <CAKzy53mddC1N2SLQaZACEWZkvRkbREH-UmiuCjWvwBmZtYuzrQ@mail.gmail.com>
	<CALqYbeN-XJHcyVftajJRJ6L2pOTYVDxsBxBY4nncF_umUXZcTA@mail.gmail.com>
Message-ID: <CAKzy53na=ehDSXH4G5yevDYYnSbkQvyjoAVr3B8wE-gxxY4jZw@mail.gmail.com>

Hi, James:

I've set swappiness to zero. We'll keep an eye on it for a while to see
whether it is the solution to the problem. So far so good

Best Regards,
Leon

On 3 March 2012 06:31, James Nichols <jamesnichols3 at gmail.com> wrote:

> What is the OS?
>
> If it is Linux, the vm.swappiness = 0 option may help.
>
> I've seen this situation before where the Linux OS will proactively swap
> out areas of memory that haven't been used in a while.  With a large heap
> and infrequent full garbage collections, the Linux virtual memory subsystem
> will sometimes do this to pages underneath the JVM's OS-level virtual
> memory.  If you then have a full garbage collection it will usually cause
> major problems if any part of the heap is on swap space.
>
> Typically what I do is set swappiness to 0, but not disable swap, and have
> an alert go off on the OS level if it uses even a single block of swap
> space.  That way there is at least a fail safe, since disabling swap can
> make the system crash if it does run out of physical memory.
>
> There are options to use Top that will show you how much of a process's
> memory space is on swap, as well as page faults.  You don't want either.
>
> Hope that helps.
>
> Jim
>
>
>
>
>  On Fri, Mar 2, 2012 at 9:47 AM, the.6th.month at gmail.com <
> the.6th.month at gmail.com> wrote:
>
>> hi,hi:
>> I've just come across a weird situation where JVM ate swap space
>> occasionally even when there's free memory available. Has anyone got any
>> idea about how to diagnose such problem?
>> The output of top is as follows, and it is sorted by memory usage
>> (shift+M):
>>
>> top - 22:36:16 up 102 days, 10:49,  2 users,  load average: 1.68, 1.39,
>> 1.34
>> Tasks:  98 total,   2 running,  96 sleeping,   0 stopped,   0 zombie
>> Cpu(s):  3.9%us,  0.6%sy,  0.0%ni, 94.4%id,  0.0%wa,  0.0%hi,  0.3%si,
>> 0.8%st
>> Mem:   6291456k total,  6276292k used,    15164k free,    16528k buffers
>> Swap:  4192924k total,    39264k used,  4153660k free,   836288k cached
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>> COMMAND
>>
>>   677 root      18   0 5259m 4.8g  10m S 18.9 79.5 398:10.95
>> java
>>
>>  1721 root      34  19  253m 5356 2196 S  0.0  0.1   0:07.81
>> yum-updatesd
>>
>>  1521 ntp       15   0 23412 5044 3916 S  0.0  0.1   0:06.03
>> ntpd
>>
>>  1482 root      15   0  154m 4512 3000 S  0.0  0.1   1:31.02
>> snmpd
>>
>> 14006 root      17   0 88080 3264 2552 S  0.0  0.1   0:00.01
>> sshd
>>
>> 14053 root      18   0 88080 3264 2552 S  0.0  0.1   0:00.00
>> sshd
>>
>> 13088 postfix   15   0 54244 2300 1796 S  0.0  0.0   0:00.00
>> pickup
>>
>>  1592 postfix   15   0 54420 1936 1816 S  0.0  0.0   0:00.09
>> qmgr
>>
>>  1580 root      15   0 54180 1828 1736 S  0.0  0.0   0:00.50
>> master
>>
>> 14008 yue.liu   15   0 88080 1716  976 R  0.0  0.0   0:00.04
>> sshd
>>
>> 14055 yue.liu   15   0 88080 1696  972 S  0.0  0.0   0:00.01
>> sshd
>>
>> 14056 yue.liu   15   0 66096 1596 1204 S  0.0  0.0   0:00.01
>> bash
>>
>> 14159 root      15   0 66096 1580 1192 S  0.0  0.0   0:00.00
>> bash
>>
>> 14101 root      15   0 66096 1576 1196 S  0.0  0.0   0:00.01
>> bash
>>
>> 14009 yue.liu   15   0 66096 1572 1184 S  0.0  0.0   0:00.01
>> bash
>>
>>  1411 haldaemo  15   0 30660 1252 1124 S  0.0  0.0   0:00.14 hald
>>
>> and the swap usage seems to come from the address space of:
>> 2aaad4000000-2aaad7ffc000 rwxp 2aaad4000000 00:00 0
>> Size:             65520 kB
>> Rss:              50476 kB
>> Shared_Clean:         0 kB
>> Shared_Dirty:         0 kB
>> Private_Clean:        0 kB
>> Private_Dirty:    50476 kB
>> Swap:             10192 kB
>> Pss:              50476 kB
>> according to the output of /proc/pid/smaps.
>>
>> the output of jmap is as below:
>> [root at l-tw14 ~]# jmap 677
>> Attaching to process ID 677, please wait...
>> Debugger attached successfully.
>> Server compiler detected.
>> JVM version is 20.1-b02
>> 0x0000000040000000    49K    /home/q/java/jdk1.6.0_26/bin/java
>> 0x0000003bf5600000    136K    /lib64/ld-2.5.so
>> 0x0000003bf5a00000    1681K    /lib64/libc-2.5.so
>> 0x0000003bf5e00000    22K    /lib64/libdl-2.5.so
>> 0x0000003bf6200000    142K    /lib64/libpthread-2.5.so
>> 0x0000003bf6600000    600K    /lib64/libm-2.5.so
>> 0x0000003bf7200000    52K    /lib64/librt-2.5.so
>> 0x0000003bf8600000    111K    /lib64/libnsl-2.5.so
>> 0x0000003bfbe00000    90K    /lib64/libresolv-2.5.so
>> 0x00002aaaaaab6000    65K
>> /home/q/java/jdk1.6.0_26/jre/lib/amd64/libverify.so
>> 0x00002aaaaabc5000    229K
>> /home/q/java/jdk1.6.0_26/jre/lib/amd64/libjava.so
>> 0x00002aaaaad00000    52K    /lib64/libnss_files-2.5.so
>> 0x00002aaaaaf0b000    90K
>> /home/q/java/jdk1.6.0_26/jre/lib/amd64/libzip.so
>> 0x00002aaab2dcc000    38K
>> /home/q/java/jdk1.6.0_26/jre/lib/amd64/libmanagement.so
>> 0x00002aaab2ed3000    110K
>> /home/q/java/jdk1.6.0_26/jre/lib/amd64/libnet.so
>> 0x00002aaab3504000    23K    /lib64/libnss_dns-2.5.so
>> 0x00002aaab3709000    43K
>> /home/q/java/jdk1.6.0_26/jre/lib/amd64/libnio.so
>> 0x00002aaab3818000    744K
>> /home/q/java/jdk1.6.0_26/jre/lib/amd64/libawt.so
>> 0x00002aaab39e7000    33K
>> /home/q/java/jdk1.6.0_26/jre/lib/amd64/headless/libmawt.so
>> 0x00002aaab3aed000    655K
>> /home/q/java/jdk1.6.0_26/jre/lib/amd64/libfontmanager.so
>> 0x00002aaabc2fc000    718K
>> /home/q/java/jdk1.6.0_26/jre/lib/amd64/libmlib_image.so
>> 0x00002aaabc4ab000    221K
>> /home/q/java/jdk1.6.0_26/jre/lib/amd64/libjpeg.so
>> 0x00002ac6d0e04000    47K
>> /home/q/java/jdk1.6.0_26/jre/lib/amd64/jli/libjli.so
>> 0x00002ac6d0f11000    13027K
>> /home/q/java/jdk1.6.0_26/jre/lib/amd64/server/libjvm.so
>>
>> [root at l-tw14 ~]# jmap -heap 677
>> Attaching to process ID 677, please wait...
>> Debugger attached successfully.
>> Server compiler detected.
>> JVM version is 20.1-b02
>>
>> using thread-local object allocation.
>> Parallel GC with 4 thread(s)
>>
>> Heap Configuration:
>>    MinHeapFreeRatio = 40
>>    MaxHeapFreeRatio = 70
>>    MaxHeapSize      = 4194304000 (4000.0MB)
>>    NewSize          = 268435456 (256.0MB)
>>    MaxNewSize       = 268435456 (256.0MB)
>>    OldSize          = 5439488 (5.1875MB)
>>    NewRatio         = 2
>>    SurvivorRatio    = 8
>>    PermSize         = 268435456 (256.0MB)
>>    MaxPermSize      = 268435456 (256.0MB)
>>
>> Heap Usage:
>> PS Young Generation
>> Eden Space:
>>    capacity = 240779264 (229.625MB)
>>    used     = 42971864 (40.981163024902344MB)
>>    free     = 197807400 (188.64383697509766MB)
>>    17.84699532929879% used
>> From Space:
>>    capacity = 1769472 (1.6875MB)
>>    used     = 1736704 (1.65625MB)
>>    free     = 32768 (0.03125MB)
>>    98.14814814814815% used
>> To Space:
>>    capacity = 13893632 (13.25MB)
>>    used     = 0 (0.0MB)
>>    free     = 13893632 (13.25MB)
>>    0.0% used
>> PS Old Generation
>>    capacity = 3925868544 (3744.0MB)
>>    used     = 3600153552 (3433.373977661133MB)
>>    free     = 325714992 (310.6260223388672MB)
>>    91.70336478795761% used
>> PS Perm Generation
>>    capacity = 268435456 (256.0MB)
>>    used     = 191475048 (182.6048355102539MB)
>>    free     = 76960408 (73.3951644897461MB)
>>    71.33001387119293% used
>>
>> it seems that the heap usage is not that high enough to occupy swap
>> space. weird. And I also checked the nio usage, it was quite trivial:
>> [root at l-tw14 ~]#java -classpath .:$JAVA_HOME/lib/sa-jdi.jar
>> DirectMemorySize `pgrep java`
>> Attaching to process ID 677, please wait...
>> Debugger attached successfully.
>> Server compiler detected.
>> JVM version is 20.1-b02
>> NIO direct memory:
>>  reserved size = 0.357348MB (374707 bytes)
>>  max size = 3968.000000MB (4160749568 bytes)
>>
>> This problem bothered me quite a few days, and I appreciate all your
>> help. Thanks in advance
>>
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120305/0cefee5e/attachment.html 

From the.6th.month at gmail.com  Mon Mar 12 05:49:10 2012
From: the.6th.month at gmail.com (the.6th.month at gmail.com)
Date: Mon, 12 Mar 2012 20:49:10 +0800
Subject: occasionally rss usage surge
Message-ID: <CAKzy53kVHPOuuwnkW1hj=Riob90Q-PBC-k-q3Hp_dFj1z4StHA@mail.gmail.com>

Hi,All:

I've recently encountered a memory usage surge of JVM, the JVM version is
jdk6, 6u26. What we see from the top command is as below:
top - 20:12:17 up 112 days,  8:26,  1 user,  load average: 0.93, 1.13, 1.16
Tasks:  95 total,   2 running,  93 sleeping,   0 stopped,   0 zombie
Cpu(s): 11.1%us,  2.1%sy,  0.0%ni, 84.9%id,  0.7%wa,  0.0%hi,  0.4%si,
0.8%st
Mem:   6291456k total,  6277716k used,    13740k free,    10844k buffers
Swap:  4192924k total,   392692k used,  3800232k free,   336692k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
COMMAND

 5349 root      19   0 6246m 5.3g 5320 S 53.6 88.2 933:38.43
java

 8627 flume     19   0 1960m  55m 2368 S  0.0  0.9   3:26.98
java

 8573 flume     18   0 1860m  17m 2052 S  0.0  0.3   0:01.97
java

 1521 ntp       15   0 23412 5044 3916 S  0.0  0.1   0:06.33
ntpd

 1719 root      34  19  254m 4772 1768 S  0.0  0.1   0:09.58
yum-updatesd

 2060 root      17   0 88872 3268 2556 S  0.0  0.1   0:00.00
sshd

 1482 root      15   0  154m 3228 1872 S  0.0  0.1   1:43.69
snmpd

 2062 yue.liu   15   0 88872 1720  980 R  0.0  0.0   0:00.03
sshd

 1589 postfix   15   0 54420 1688 1556 S  0.0  0.0   0:00.20
qmgr

 1580 root      16   0 54180 1544 1456 S  0.0  0.0   0:00.59
master

 2095 root      15   0 66104 1536 1196 S  0.0  0.0   0:00.01
bash

 2063 yue.liu   16   0 66104 1520 1184 S  0.0  0.0   0:00.00
bash


It's weird that jvm takes up 5.3g memory, while the jvm heap size is
constrained to 4g as specified by JAVA_OPTS:
-Xms3800m -Xmx3800m -Xss256k -Xmn256m -XX:PermSize=256m -server

We also ran jmap to check the heap usage and didn't find any problem with
it:
Attaching to process ID 5349, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 20.1-b02

using thread-local object allocation.
Parallel GC with 4 thread(s)

Heap Configuration:
   MinHeapFreeRatio = 40
   MaxHeapFreeRatio = 70
   MaxHeapSize      = 3984588800 (3800.0MB)
   NewSize          = 268435456 (256.0MB)
   MaxNewSize       = 268435456 (256.0MB)
   OldSize          = 5439488 (5.1875MB)
   NewRatio         = 2
   SurvivorRatio    = 8
   PermSize         = 268435456 (256.0MB)
   MaxPermSize      = 268435456 (256.0MB)

Heap Usage:
PS Young Generation
Eden Space:
   capacity = 159514624 (152.125MB)
   used     = 113902600 (108.62598419189453MB)
   free     = 45612024 (43.49901580810547MB)
   71.40574145728482% used
>From Space:
   capacity = 26083328 (24.875MB)
   used     = 26082200 (24.873924255371094MB)
   free     = 1128 (0.00107574462890625MB)
   99.99567539847676% used
To Space:
   capacity = 54460416 (51.9375MB)
   used     = 0 (0.0MB)
   free     = 54460416 (51.9375MB)
   0.0% used
PS Old Generation
   capacity = 3716153344 (3544.0MB)
   used     = 1924262008 (1835.119255065918MB)
   free     = 1791891336 (1708.880744934082MB)
   51.78101735513312% used
PS Perm Generation
   capacity = 268435456 (256.0MB)
   used     = 106493600 (101.56021118164062MB)
   free     = 161941856 (154.43978881835938MB)
   39.67195749282837% used

Then I grep the smaps for RSS usage, and found some segments with high
memory residence:
grep Rss /proc/5349/smaps | awk '{if($(NF-1)>0) print $(NF-1)}' | sort -rn
| head -20
4091880
111036
65516
65508
65500
65492
65124
64276
64276
64012
62120
60664
59700
58076
57688
56144
55424
48584
42896
36512

the 4091880 and 111036 comes from $JAVA_HOME/bin/java, *and the other tens
of 60 megabytes usages, which sum up to 1g*, come from :
2aaabc000000-2aaabc008000 r-xs 00115000 ca:07 6488847
/home/q/java/jdk1.6.0_26/jre/lib/resources.jar
Size:                32 kB
Rss:                  0 kB
Shared_Clean:         0 kB
Shared_Dirty:         0 kB
Private_Clean:        0 kB
Private_Dirty:        0 kB
Swap:                 0 kB
Pss:                  0 kB
2aaac0000000-2aaac4000000 rwxp 2aaac0000000 00:00 0
Size:             65536 kB
Rss:              51620 kB
Shared_Clean:         0 kB
Shared_Dirty:         0 kB
Private_Clean:        0 kB
Private_Dirty:    51620 kB
Swap:             12656 kB
Pss:              51620 kB
2aaac8000000-2aaacbff7000 rwxp 2aaac8000000 00:00 0
Size:             65500 kB
Rss:              65500 kB
Shared_Clean:         0 kB
Shared_Dirty:         0 kB
Private_Clean:        0 kB
Private_Dirty:    65500 kB
Swap:                 0 kB
Pss:              65500 kB
2aaacc000000-2aaacfff9000 rwxp 2aaacc000000 00:00 0
Size:             65508 kB
Rss:              64000 kB
Shared_Clean:         0 kB
Shared_Dirty:         0 kB
Private_Clean:        0 kB
Private_Dirty:    64000 kB
Swap:              1508 kB
Pss:              64000 kB
2aaad0000000-2aaad3ff5000 rwxp 2aaad0000000 00:00 0
Size:             65492 kB
Rss:              47784 kB
Shared_Clean:         0 kB
Shared_Dirty:         0 kB
Private_Clean:        0 kB
Private_Dirty:    47784 kB
Swap:             17708 kB
Pss:              47784 kB
2aaad4000000-2aaad7ffb000 rwxp 2aaad4000000 00:00 0
Size:             65516 kB
Rss:              65516 kB
Shared_Clean:         0 kB
Shared_Dirty:         0 kB
Private_Clean:        0 kB
Private_Dirty:    65516 kB
Swap:                 0 kB
Pss:              65516 kB
2aaad8000000-2aaadbe83000 rwxp 2aaad8000000 00:00 0
Size:             64012 kB
Rss:              64012 kB
Shared_Clean:         0 kB
Shared_Dirty:         0 kB
Private_Clean:        0 kB
Private_Dirty:    64012 kB
Swap:                 0 kB
Pss:              64012 kB
2aaae8000000-2aaaebff7000 rwxp 2aaae8000000 00:00 0
Size:             65500 kB
Rss:              65492 kB
Shared_Clean:         0 kB
Shared_Dirty:         0 kB
Private_Clean:        0 kB
Private_Dirty:    65492 kB
Swap:                 0 kB
Pss:              65492 kB
2aaaec000000-2aaaefffd000 rwxp 2aaaec000000 00:00 0
Size:             65524 kB
Rss:              59700 kB
Shared_Clean:         0 kB
Shared_Dirty:         0 kB
Private_Clean:        0 kB
Private_Dirty:    59700 kB
Swap:                 0 kB

*I am wondering whether it is because I set /proc/sys/vm/swappiness to
zero, or if not, is it because of the resources.jar*? Does anyone know what
this jar is used for and have any idea about the cause of the memory usage
surge

Any help would be highly appreciated, thanks very much.

All the best,
Leon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120312/e5abbd78/attachment.html 

From fancyerii at gmail.com  Thu Mar 15 00:19:18 2012
From: fancyerii at gmail.com (Li Li)
Date: Thu, 15 Mar 2012 15:19:18 +0800
Subject: about +UseCompressedOOPS in 64bit JVM
Message-ID: <CAFAd71W+K_chNs18_jFMwHWzDiDJqaUG8=aHfu8LpwCfY8csvA@mail.gmail.com>

   for 64bit JVM, if heap usage is less than 4GB, it should enable this
option. it can reduce memory usage and performance(for cpu cache). What if
the JVM use large memory?
   http://lists.apple.com/archives/java-dev/2010/Apr/msg00157.html this
post says enable it may slow application down.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120315/505a7528/attachment.html 

From jon.masamitsu at oracle.com  Thu Mar 15 08:00:28 2012
From: jon.masamitsu at oracle.com (Jon Masamitsu)
Date: Thu, 15 Mar 2012 08:00:28 -0700
Subject: about +UseCompressedOOPS in 64bit JVM
In-Reply-To: <CAFAd71W+K_chNs18_jFMwHWzDiDJqaUG8=aHfu8LpwCfY8csvA@mail.gmail.com>
References: <CAFAd71W+K_chNs18_jFMwHWzDiDJqaUG8=aHfu8LpwCfY8csvA@mail.gmail.com>
Message-ID: <4F62040C.8010207@oracle.com>



On 03/15/12 00:19, Li Li wrote:
>    for 64bit JVM, if heap usage is less than 4GB, it should enable 
> this option. it can reduce memory usage and performance(for cpu 
> cache). What if the JVM use large memory?
> http://lists.apple.com/archives/java-dev/2010/Apr/msg00157.html this 
> post says enable it may slow application down.

I think that the responses to that thread make the relevant points.   
Many applications
benefit from compressed oops but some don't.  In the end you just need 
to try it with
your application and see if it helps.

>
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120315/bc118bdc/attachment.html 

From taras.tielkes at gmail.com  Tue Mar 20 09:36:29 2012
From: taras.tielkes at gmail.com (Taras Tielkes)
Date: Tue, 20 Mar 2012 17:36:29 +0100
Subject: Promotion failures: indication of CMS fragmentation?
In-Reply-To: <CA+R7V783r2459y7r6zxXNP9_eQ2KOa5Oh0Mr2tPhS1d-8H-ing@mail.gmail.com>
References: <CA+R7V78oeNvQwWOjagdANw=h0Ws_p5da7BDeOhguoKT1V5n5dQ@mail.gmail.com>
	<4EF9FCAC.3030208@oracle.com>
	<CA+R7V7-SGdXmbtqo=+2VQwKVnCVCZdj4M=gQfrxiGf2fEMi3cA@mail.gmail.com>
	<4F06A270.3010701@oracle.com>
	<CA+R7V78Twoz0a=J5oCRYJjBdnptPdUv9Jnvt4wiLUsh3Cy+bHw@mail.gmail.com>
	<4F0DBEC4.7040907@oracle.com>
	<CA+R7V7-pxrKH5L2brxZRZwKrv7ZF3aYtQkZmb7-A=nSLn5QfYg@mail.gmail.com>
	<4F1ECE7B.3040502@oracle.com>
	<CA+R7V79x29mXvkEKuPnCYrAJfZjzHc5QnfgrNCYPZFO8GRYayg@mail.gmail.com>
	<4F1F2ED7.6060308@oracle.com>
	<CA+R7V7_P4xdsOMdM+KgiO-urNMiPakQQcdjnOQ_yYo4KZhko2w@mail.gmail.com>
	<4F20F78D.9070905@oracle.com>
	<CA+R7V79M0B2UTqqxiUGfoK-1pMP54e+biBnH+wy=zGEA2vjihg@mail.gmail.com>
	<CA+R7V79F59SJL6F7QvmWAuCKyisv5MFuDvsBfkDuvU0UcZ_iOw@mail.gmail.com>
	<CA+R7V7_st6DPnJZOMUnAeRVeYND42Y19rAUjNJ+PhtF72Ur2mQ@mail.gmail.com>
	<CAG7eTFoeYitaBjgt2eUT3kXqU2SGk1eC5eofdAAL1SjuCFMHCg@mail.gmail.com>
	<CABzyjyna2Mq7EXDiZ8mVB=1MX9Gw1=e2z8zO8X69QeodVKbBrg@mail.gmail.com>
	<CA+R7V79NDPWq2YeX8NQ9j6XW8P7=dZbQWJwzG6dxz4UgUGvKuA@mail.gmail.com>
	<CA+R7V783r2459y7r6zxXNP9_eQ2KOa5Oh0Mr2tPhS1d-8H-ing@mail.gmail.com>
Message-ID: <CA+R7V78c=D70MJv-jUw2X4ayOOUV0UBZDphcucRSpiYYncoV1A@mail.gmail.com>

Hi,

I've collected -XX:+PrintTenuringDistribution data from a node in our
production environment, running -Xmx5g -Xmn400m -XX:SurvivorRatio=8.
On one other production node, we've configured a larger new gen, and
larger survivor spaces (-Xmx5g -Xmn800m -XX:SurvivorRatio=4).
This node has -XX:+PrintTenuringDistribution logging as well.

The node running the larger new gen and survivor spaces has not run
into a promotion failure yet, while the ones still running the old
config have hit a few.
The promotion failures are typically experienced at high load periods,
which makes sense, as allocation and promotion will experience a spike
in those periods as well.

The inherent nature of the application implies relatively long
sessions (towards a few hours), retaining a fair amout of state up to
an hour.
I believe this is the main reason of the relatively high promotion
rate we're experiencing.


Here's a fragment of gc log from one of the nodes running the older
(smaller) new gen, including a promotion failure:
-------------------------
2012-03-15T18:32:17.785+0100: 796604.225: [GC 796604.225: [ParNew
Desired survivor size 20971520 bytes, new threshold 8 (max 15)
- age   1:    2927728 bytes,    2927728 total
- age   2:    2428512 bytes,    5356240 total
- age   3:    2696376 bytes,    8052616 total
- age   4:    2623576 bytes,   10676192 total
- age   5:    3365576 bytes,   14041768 total
- age   6:    2792272 bytes,   16834040 total
- age   7:    2233008 bytes,   19067048 total
- age   8:    2263824 bytes,   21330872 total
: 358709K->29362K(368640K), 0.0461460 secs]
3479492K->3151874K(5201920K), 0.0467320 secs] [Times: user=0.34
sys=0.01, real=0.05 secs]
2012-03-15T18:32:21.546+0100: 796607.986: [GC 796607.986: [ParNew (0:
promotion failure size = 25)  (1: promotion failure size = 25)  (2:
promotion failure size = 25)  (3: promotion failure size = 25)  (4:
promotion failure size = 25)  (5
: promotion failure size = 25)  (6: promotion failure size = 341)  (7:
promotion failure size = 25)  (promotion failed)
Desired survivor size 20971520 bytes, new threshold 8 (max 15)
- age   1:    3708208 bytes,    3708208 total
- age   2:    2174384 bytes,    5882592 total
- age   3:    2383256 bytes,    8265848 total
- age   4:    2689912 bytes,   10955760 total
- age   5:    2621832 bytes,   13577592 total
- age   6:    3360440 bytes,   16938032 total
- age   7:    2784136 bytes,   19722168 total
- age   8:    2220232 bytes,   21942400 total
: 357042K->356456K(368640K), 0.2734100 secs]796608.259: [CMS:
3124189K->516640K(4833280K), 6.8127070 secs]
3479554K->516640K(5201920K), [CMS Perm : 142423K->142284K(262144K)],
7.0867850 secs] [Times: user=7.32 sys=0.07, real=7.09 secs]
2012-03-15T18:32:30.279+0100: 796616.719: [GC 796616.720: [ParNew
Desired survivor size 20971520 bytes, new threshold 1 (max 15)
- age   1:   29721456 bytes,   29721456 total
: 327680K->40960K(368640K), 0.0403130 secs]
844320K->557862K(5201920K), 0.0409070 secs] [Times: user=0.27
sys=0.01, real=0.04 secs]
2012-03-15T18:32:32.701+0100: 796619.141: [GC 796619.141: [ParNew
Desired survivor size 20971520 bytes, new threshold 15 (max 15)
- age   1:   10310176 bytes,   10310176 total
-------------------------

For contrast, here's a gc log fragment from the single node running
the larger new gen and larger survivor spaces:
(the fragment is from the same point in time, with the nodes
experiencing equal load)
-------------------------
2012-03-15T18:32:12.067+0100: 797119.336: [GC 797119.336: [ParNew
Desired survivor size 69894144 bytes, new threshold 15 (max 15)
- age   1:    5611536 bytes,    5611536 total
- age   2:    3731888 bytes,    9343424 total
- age   3:    3450672 bytes,   12794096 total
- age   4:    3314744 bytes,   16108840 total
- age   5:    3459888 bytes,   19568728 total
- age   6:    3334712 bytes,   22903440 total
- age   7:    3671960 bytes,   26575400 total
- age   8:    3841608 bytes,   30417008 total
- age   9:    2035392 bytes,   32452400 total
- age  10:    1975056 bytes,   34427456 total
- age  11:    2021344 bytes,   36448800 total
- age  12:    1520752 bytes,   37969552 total
- age  13:    1494176 bytes,   39463728 total
- age  14:    2355136 bytes,   41818864 total
- age  15:    1279000 bytes,   43097864 total
: 603473K->61640K(682688K), 0.0756570 secs]
3373284K->2832383K(5106368K), 0.0762090 secs] [Times: user=0.56
sys=0.00, real=0.08 secs]
2012-03-15T18:32:18.200+0100: 797125.468: [GC 797125.469: [ParNew
Desired survivor size 69894144 bytes, new threshold 15 (max 15)
- age   1:    6101320 bytes,    6101320 total
- age   2:    4446776 bytes,   10548096 total
- age   3:    3701384 bytes,   14249480 total
- age   4:    3438488 bytes,   17687968 total
- age   5:    3295360 bytes,   20983328 total
- age   6:    3403320 bytes,   24386648 total
- age   7:    3323368 bytes,   27710016 total
- age   8:    3665760 bytes,   31375776 total
- age   9:    2427904 bytes,   33803680 total
- age  10:    1418656 bytes,   35222336 total
- age  11:    1955192 bytes,   37177528 total
- age  12:    2006064 bytes,   39183592 total
- age  13:    1520768 bytes,   40704360 total
- age  14:    1493728 bytes,   42198088 total
- age  15:    2354376 bytes,   44552464 total
: 607816K->62650K(682688K), 0.0779270 secs]
3378559K->2834643K(5106368K), 0.0784690 secs] [Times: user=0.58
sys=0.00, real=0.08 secs]
-------------------------

Questions:

1) From the tenuring distributions, it seems that the application
benefits from larger new gen and survivor spaces.
The next thing we'll try is to run with -Xmn1g -XX:SurvivorRatio=2,
and see if the ParNew times are still acceptable.
Does this seem a sensible approach in this context?
Are there other variables beyond ParNew times that limit scaling the
new gen to a large size?

2) Given the object age demographics inherent to our application, we
can not expect to see the majority of data get collected in the new
gen.

Our approach to fight the promotion failures consists of three aspects:
a) Lower the overall allocation rate of our application (by improving
wasteful hotspots), to decrease overall ParNew collection frequency.
b) Configure the new gen and survivor spaces as large as possible,
keeping an eye on ParNew times and overall new/tenured ratio.
c) Try to refactor the data structures that form the bulk of promoted
data, to retain only the strictly required subgraphs.

Is there anything else I can try or measure, in order to better
understand the problem?

Thanks in advance,
Taras


On Wed, Feb 22, 2012 at 10:51 AM, Taras Tielkes <taras.tielkes at gmail.com> wrote:
> (this time properly responding to the list alias)
> Hi Srinivas,
>
> We're running 1.6.0 u29 on Linux x64. My understanding is that
> CompressedOops is enabled by default since u23.
>
> At least this page seems to support that:
> http://docs.oracle.com/javase/7/docs/technotes/guides/vm/performance-enhancements-7.html
>
> Regarding the other remarks (also from Todd and Chi), I'll comment
> later. The first thing on my list is to collect
> PrintTenuringDistribution data now.
>
> Kind regards,
> Taras
>
> On Wed, Feb 22, 2012 at 10:50 AM, Taras Tielkes <taras.tielkes at gmail.com> wrote:
>> Hi Srinivas,
>>
>> We're running 1.6.0 u29 on Linux x64. My understanding is that
>> CompressedOops is enabled by default since u23.
>>
>> At least this page seems to support that:
>> http://docs.oracle.com/javase/7/docs/technotes/guides/vm/performance-enhancements-7.html
>>
>> Regarding the other remarks (also from Todd and Chi), I'll comment
>> later. The first thing on my list is to collect
>> PrintTenuringDistribution data now.
>>
>> Kind regards,
>> Taras
>>
>> On Wed, Feb 22, 2012 at 12:40 AM, Srinivas Ramakrishna
>> <ysr1729 at gmail.com> wrote:
>>> I agree that premature promotions are almost always the first and most
>>> important thing to fix when running
>>> into fragmentation or overload issues with CMS. However, I can also imagine
>>> long-lived objects with a highly
>>> non-stationary size distribution which can also cause problems for CMS
>>> despite best efforts to tune against
>>> premature promotion.
>>>
>>> I didn't think Treas was running with MTT=0, although MTT > 0 is no recipe
>>> for avoiding premature promotion
>>> with bursty loads that case overflow the survivor spaces -- as you say large
>>> survivor spaces with a low
>>> TargetSurvivorRatio -- so as to leave plenty of space to absorb/accommodate
>>> spiking/bursty loads? is
>>> definitely a "best practice" for CMS (and possibly for other concurrent
>>> collectors as well).
>>>
>>> One thing Taras can do to see if premature promotion might be an issue is to
>>> look at the tenuring
>>> threshold in his case. A rough proxy (if PrintTenuringDistribution is not
>>> enabled) is to look at the
>>> promotion volume per scavenge. It may be possible, if premature promotion is
>>> a cause, to see
>>> some kind of medium-term correlation between high promotion volume and
>>> eventual promotion
>>> failure despite frequent CMS collections.
>>>
>>> One other point which may or may not be relevant. I see that Taras is not
>>> using CompressedOops...
>>> Using that alone would greatly decrease memory pressure and provide more
>>> breathing room to CMS,
>>> which is also almost always a good idea.
>>>
>>> -- ramki
>>>
>>> On Tue, Feb 21, 2012 at 10:16 AM, Chi Ho Kwok <chkwok at digibites.nl> wrote:
>>>>
>>>> Hi Teras,
>>>>
>>>> I think you may want to look into sizing the new and especially the
>>>> survivor spaces differently. We run something similar to what you described,
>>>> high volume request processing with large dataset loading, and what we've
>>>> seen at the start is that the survivor spaces are completely overloaded,
>>>> causing premature promotions.
>>>>
>>>> We've configured our vm with the following goals/guideline:
>>>>
>>>> old space is for semi-permanent data, living for at least 30s, average ~10
>>>> minutes
>>>> new space contains only temporary and just loaded data
>>>> surviving objects from new should never reach old in 1 gc, so the survivor
>>>> space may never be 100% full
>>>>
>>>> With jstat -gcutil `pidof java` 2000, we see things like:
>>>>
>>>> ? S0 ? ? S1 ? ? E ? ? ?O ? ? ?P ? ? YGC ? ? YGCT ? ?FGC ? ?FGCT ? ? GCT
>>>> ?70.20 ? 0.00 ?19.65 ?57.60 ?59.90 124808 29474.299 ?2498 ?191.110
>>>> 29665.409
>>>> ?70.20 ? 0.00 ?92.89 ?57.60 ?59.90 124808 29474.299 ?2498 ?191.110
>>>> 29665.409
>>>> ?70.20 ? 0.00 ?93.47 ?57.60 ?59.90 124808 29474.299 ?2498 ?191.110
>>>> 29665.409
>>>> ? 0.00 ?65.69 ?78.07 ?58.09 ?59.90 124809 29474.526 ?2498 ?191.110
>>>> 29665.636
>>>> ?84.97 ? 0.00 ?48.19 ?58.57 ?59.90 124810 29474.774 ?2498 ?191.110
>>>> 29665.884
>>>> ?84.97 ? 0.00 ?81.30 ?58.57 ?59.90 124810 29474.774 ?2498 ?191.110
>>>> 29665.884
>>>> ? 0.00 ?62.64 ?27.22 ?59.12 ?59.90 124811 29474.992 ?2498 ?191.110
>>>> 29666.102
>>>> ? 0.00 ?62.64 ?54.47 ?59.12 ?59.90 124811 29474.992 ?2498 ?191.110
>>>> 29666.102
>>>> ?75.68 ? 0.00 ? 6.80 ?59.53 ?59.90 124812 29475.228 ?2498 ?191.110
>>>> 29666.338
>>>> ?75.68 ? 0.00 ?23.38 ?59.53 ?59.90 124812 29475.228 ?2498 ?191.110
>>>> 29666.338
>>>> ?75.68 ? 0.00 ?27.72 ?59.53 ?59.90 124812 29475.228 ?2498 ?191.110
>>>> 29666.338
>>>>
>>>> If you follow the lines, you can see Eden fill up to 100% on line 4,
>>>> surviving objects are copied into S1, S0 is collected and added 0.49% to
>>>> Old. On line 5, another GC happened, with Eden->S0, S1->Old, etc. No objects
>>>> is ever transferred from Eden to Old, unless there's a huge peak of
>>>> requests.
>>>>
>>>> This is with a: 32GB heap, Mxn1200M, SurvivorRatio 2 (600MB Eden, 300MB
>>>> S0, 300MB S1), MaxTenuringThreshold 1 (whatever is still alive in S0/1 on
>>>> the second GC is copied to old, don't wait, web requests are quite bursty).
>>>> With about 1 collection every 2-5 seconds, objects promoted to Old must live
>>>> for at 4-10 seconds; as that's longer than an average request (50ms-1s),
>>>> none of the temporary data ever makes it into Old, which is much more
>>>> expensive to collect. It works even with a higher than default
>>>> CMSInitiatingOccupancyFraction=76 to optimize for space available for the
>>>> large data cache we have.
>>>>
>>>>
>>>> With your config of 400MB Total new, with 350MB Eden, 25MB S0, 25MB S1
>>>> (SurvivorRatio 8), no tenuring threshold, I think loads of new objects get
>>>> copied from Eden to Old directly, causing trouble for the CMS. You can use
>>>> jstat to get live stats and tweak until it doesn't happen. If you can't make
>>>> changes on live that easil, try doubling the new size indeed, with a 400
>>>> Eden, 200 S0, 200 S1 and?MaxTenuringThreshold?1 setting. It's probably
>>>> overkill, but if should solve the problem if it is caused by premature
>>>> promotion.
>>>>
>>>>
>>>> Chi Ho Kwok
>>>>
>>>>
>>>> On Tue, Feb 21, 2012 at 5:55 PM, Taras Tielkes <taras.tielkes at gmail.com>
>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> We've removed the "-XX:+CMSScavengeBeforeRemark" setting from 50% of
>>>>> our production nodes.
>>>>> After running for a few weeks, it seems that there's no impact from
>>>>> removing this option.
>>>>> Which is good, since it seems we can remove it from the other nodes as
>>>>> well, simplifying our overall JVM configuration ;-)
>>>>>
>>>>> However, we're still seeing promotion failures on all nodes, once
>>>>> every day or so.
>>>>>
>>>>> There's still the "Magic 1026": this accounts for ~60% of the
>>>>> promotion failures that we're seeing (single ParNew thread thread,
>>>>> 1026 failure size):
>>>>> --------------------
>>>>> 2012-02-06T09:13:51.806+0100: 328095.085: [GC 328095.086: [ParNew:
>>>>> 359895K->29357K(368640K), 0.0429070 secs]
>>>>> 3471021K->3143476K(5201920K), 0.0434950 secs] [Times: user=0.32
>>>>> sys=0.00, real=0.04 secs]
>>>>> 2012-02-06T09:13:55.922+0100: 328099.201: [GC 328099.201: [ParNew:
>>>>> 357037K->31817K(368640K), 0.0429130 secs]
>>>>> 3471156K->3148946K(5201920K), 0.0434930 secs] [Times: user=0.31
>>>>> sys=0.00, real=0.04 secs]
>>>>> 2012-02-06T09:13:59.044+0100: 328102.324: [GC 328102.324: [ParNew
>>>>> (promotion failure size = 1026) ?(promotion failed):
>>>>> 359497K->368640K(368640K), 0.2226790 secs]328102.547: [CMS:
>>>>> 3125609K->451515K(4833280K), 5.6225880 secs] 3476626K->4515
>>>>> 15K(5201920K), [CMS Perm : 124373K->124353K(262144K)], 5.8459380 secs]
>>>>> [Times: user=6.20 sys=0.01, real=5.85 secs]
>>>>> 2012-02-06T09:14:05.243+0100: 328108.522: [GC 328108.523: [ParNew:
>>>>> 327680K->40960K(368640K), 0.0319160 secs] 779195K->497658K(5201920K),
>>>>> 0.0325360 secs] [Times: user=0.21 sys=0.01, real=0.03 secs]
>>>>> 2012-02-06T09:14:07.836+0100: 328111.116: [GC 328111.116: [ParNew:
>>>>> 368640K->32785K(368640K), 0.0744670 secs] 825338K->520234K(5201920K),
>>>>> 0.0750390 secs] [Times: user=0.40 sys=0.02, real=0.08 secs]
>>>>> --------------------
>>>>> Given the 1026 word size, I'm wondering if I should be hunting for an
>>>>> overuse of BufferedInputStream/BufferedOutoutStream, since both have
>>>>> 8192 as a default buffer size.
>>>>>
>>>>> The second group of promotion failures look like this (multiple ParNew
>>>>> threads, small failure sizes):
>>>>> --------------------
>>>>> 2012-02-06T09:50:15.773+0100: 328756.964: [GC 328756.964: [ParNew:
>>>>> 356116K->29934K(368640K), 0.0461100 secs]
>>>>> 3203863K->2880162K(5201920K), 0.0468870 secs] [Times: user=0.34
>>>>> sys=0.01, real=0.05 secs]
>>>>> 2012-02-06T09:50:19.153+0100: 328760.344: [GC 328760.344: [ParNew:
>>>>> 357614K->30359K(368640K), 0.0454680 secs]
>>>>> 3207842K->2882892K(5201920K), 0.0462280 secs] [Times: user=0.33
>>>>> sys=0.01, real=0.05 secs]
>>>>> 2012-02-06T09:50:22.658+0100: 328763.849: [GC 328763.849: [ParNew (1:
>>>>> promotion failure size = 25) ?(4: promotion failure size = 25) ?(6:
>>>>> promotion failure size = 25) ?(7: promotion failure size = 144)
>>>>> (promotion failed): 358039K->358358
>>>>> K(368640K), 0.2148680 secs]328764.064: [CMS:
>>>>> 2854709K->446750K(4833280K), 5.8368270 secs]
>>>>> 3210572K->446750K(5201920K), [CMS Perm : 124670K->124644K(262144K)],
>>>>> 6.0525230 secs] [Times: user=6.32 sys=0.00, real=6.05 secs]
>>>>> 2012-02-06T09:50:29.896+0100: 328771.086: [GC 328771.087: [ParNew:
>>>>> 327680K->22569K(368640K), 0.0227080 secs] 774430K->469319K(5201920K),
>>>>> 0.0235020 secs] [Times: user=0.16 sys=0.00, real=0.02 secs]
>>>>> 2012-02-06T09:50:31.076+0100: 328772.266: [GC 328772.267: [ParNew:
>>>>> 350249K->22264K(368640K), 0.0235480 secs] 796999K->469014K(5201920K),
>>>>> 0.0243000 secs] [Times: user=0.18 sys=0.01, real=0.02 secs]
>>>>> --------------------
>>>>>
>>>>> We're going to try to double the new size on a single node, to see the
>>>>> effects of that.
>>>>>
>>>>> Beyond this experiment, is there any additional data I can collect to
>>>>> better understand the nature of the promotion failures?
>>>>> Am I facing collecting free list statistics at this point?
>>>>>
>>>>> Thanks,
>>>>> Taras
>>>>
>>>>
>>>> _______________________________________________
>>>> hotspot-gc-use mailing list
>>>> hotspot-gc-use at openjdk.java.net
>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>
>>>

From chkwok at digibites.nl  Tue Mar 20 10:44:21 2012
From: chkwok at digibites.nl (Chi Ho Kwok)
Date: Tue, 20 Mar 2012 18:44:21 +0100
Subject: Promotion failures: indication of CMS fragmentation?
In-Reply-To: <CA+R7V78c=D70MJv-jUw2X4ayOOUV0UBZDphcucRSpiYYncoV1A@mail.gmail.com>
References: <CA+R7V78oeNvQwWOjagdANw=h0Ws_p5da7BDeOhguoKT1V5n5dQ@mail.gmail.com>
	<4EF9FCAC.3030208@oracle.com>
	<CA+R7V7-SGdXmbtqo=+2VQwKVnCVCZdj4M=gQfrxiGf2fEMi3cA@mail.gmail.com>
	<4F06A270.3010701@oracle.com>
	<CA+R7V78Twoz0a=J5oCRYJjBdnptPdUv9Jnvt4wiLUsh3Cy+bHw@mail.gmail.com>
	<4F0DBEC4.7040907@oracle.com>
	<CA+R7V7-pxrKH5L2brxZRZwKrv7ZF3aYtQkZmb7-A=nSLn5QfYg@mail.gmail.com>
	<4F1ECE7B.3040502@oracle.com>
	<CA+R7V79x29mXvkEKuPnCYrAJfZjzHc5QnfgrNCYPZFO8GRYayg@mail.gmail.com>
	<4F1F2ED7.6060308@oracle.com>
	<CA+R7V7_P4xdsOMdM+KgiO-urNMiPakQQcdjnOQ_yYo4KZhko2w@mail.gmail.com>
	<4F20F78D.9070905@oracle.com>
	<CA+R7V79M0B2UTqqxiUGfoK-1pMP54e+biBnH+wy=zGEA2vjihg@mail.gmail.com>
	<CA+R7V79F59SJL6F7QvmWAuCKyisv5MFuDvsBfkDuvU0UcZ_iOw@mail.gmail.com>
	<CA+R7V7_st6DPnJZOMUnAeRVeYND42Y19rAUjNJ+PhtF72Ur2mQ@mail.gmail.com>
	<CAG7eTFoeYitaBjgt2eUT3kXqU2SGk1eC5eofdAAL1SjuCFMHCg@mail.gmail.com>
	<CABzyjyna2Mq7EXDiZ8mVB=1MX9Gw1=e2z8zO8X69QeodVKbBrg@mail.gmail.com>
	<CA+R7V79NDPWq2YeX8NQ9j6XW8P7=dZbQWJwzG6dxz4UgUGvKuA@mail.gmail.com>
	<CA+R7V783r2459y7r6zxXNP9_eQ2KOa5Oh0Mr2tPhS1d-8H-ing@mail.gmail.com>
	<CA+R7V78c=D70MJv-jUw2X4ayOOUV0UBZDphcucRSpiYYncoV1A@mail.gmail.com>
Message-ID: <CAG7eTFqwUOJk-UWYSSsSP6tpupS3B9BB1ZpDbfqH3SG4m3RAkQ@mail.gmail.com>

Hi Teras,

If the new setting works, is there still a reason to tune it further?

Further, from the stats, I see that that contrary to what you say, most of
the objects *are* getting collected in the young gen and survivor space.
Sure, there are some states that have to be kept for hours, but that's what
the old gen is for. For eden + survivor, I can see that the ~68M survivor
space in the new vm setup is enough to let only a tiny fraction into the
old gen: from the 590MB collected, only ~1MB made it into the old gen, and
on the next collection, only the ~2MB "age 15" group will get promoted if
they survive, unless a huge spike drive multiple generations away and
threshold drops below 15.

It's a lot easier to promote only 1-2MB of data x times per minute than 2M
* 2x times per minute - with 2 times as much young gen, the collector runs
half as often. You can decrease the survivor ratio a bit more for safety
margin (to 3-4?, is it set to 5 now? calc says ~10%), but not to 1, you
won't have enough space left young gen then.

Oh, and if you want to take a look at the live situation, try jstat on a
shell. It's the best tool you have to visualize GC, unless you manage to
convince ops to let you connect a jvisualvm+visualgc to live servers.


Chi Ho

On Tue, Mar 20, 2012 at 5:36 PM, Taras Tielkes <taras.tielkes at gmail.com>wrote:

> Hi,
>
> I've collected -XX:+PrintTenuringDistribution data from a node in our
> production environment, running -Xmx5g -Xmn400m -XX:SurvivorRatio=8.
> On one other production node, we've configured a larger new gen, and
> larger survivor spaces (-Xmx5g -Xmn800m -XX:SurvivorRatio=4).
> This node has -XX:+PrintTenuringDistribution logging as well.
>
> The node running the larger new gen and survivor spaces has not run
> into a promotion failure yet, while the ones still running the old
> config have hit a few.
> The promotion failures are typically experienced at high load periods,
> which makes sense, as allocation and promotion will experience a spike
> in those periods as well.
>
> The inherent nature of the application implies relatively long
> sessions (towards a few hours), retaining a fair amout of state up to
> an hour.
> I believe this is the main reason of the relatively high promotion
> rate we're experiencing.
>
>
> Here's a fragment of gc log from one of the nodes running the older
> (smaller) new gen, including a promotion failure:
> -------------------------
> 2012-03-15T18:32:17.785+0100: 796604.225: [GC 796604.225: [ParNew
> Desired survivor size 20971520 bytes, new threshold 8 (max 15)
> - age   1:    2927728 bytes,    2927728 total
> - age   2:    2428512 bytes,    5356240 total
> - age   3:    2696376 bytes,    8052616 total
> - age   4:    2623576 bytes,   10676192 total
> - age   5:    3365576 bytes,   14041768 total
> - age   6:    2792272 bytes,   16834040 total
> - age   7:    2233008 bytes,   19067048 total
> - age   8:    2263824 bytes,   21330872 total
> : 358709K->29362K(368640K), 0.0461460 secs]
> 3479492K->3151874K(5201920K), 0.0467320 secs] [Times: user=0.34
> sys=0.01, real=0.05 secs]
> 2012-03-15T18:32:21.546+0100: 796607.986: [GC 796607.986: [ParNew (0:
> promotion failure size = 25)  (1: promotion failure size = 25)  (2:
> promotion failure size = 25)  (3: promotion failure size = 25)  (4:
> promotion failure size = 25)  (5
> : promotion failure size = 25)  (6: promotion failure size = 341)  (7:
> promotion failure size = 25)  (promotion failed)
> Desired survivor size 20971520 bytes, new threshold 8 (max 15)
> - age   1:    3708208 bytes,    3708208 total
> - age   2:    2174384 bytes,    5882592 total
> - age   3:    2383256 bytes,    8265848 total
> - age   4:    2689912 bytes,   10955760 total
> - age   5:    2621832 bytes,   13577592 total
> - age   6:    3360440 bytes,   16938032 total
> - age   7:    2784136 bytes,   19722168 total
> - age   8:    2220232 bytes,   21942400 total
> : 357042K->356456K(368640K), 0.2734100 secs]796608.259: [CMS:
> 3124189K->516640K(4833280K), 6.8127070 secs]
> 3479554K->516640K(5201920K), [CMS Perm : 142423K->142284K(262144K)],
> 7.0867850 secs] [Times: user=7.32 sys=0.07, real=7.09 secs]
> 2012-03-15T18:32:30.279+0100: 796616.719: [GC 796616.720: [ParNew
> Desired survivor size 20971520 bytes, new threshold 1 (max 15)
> - age   1:   29721456 bytes,   29721456 total
> : 327680K->40960K(368640K), 0.0403130 secs]
> 844320K->557862K(5201920K), 0.0409070 secs] [Times: user=0.27
> sys=0.01, real=0.04 secs]
> 2012-03-15T18:32:32.701+0100: 796619.141: [GC 796619.141: [ParNew
> Desired survivor size 20971520 bytes, new threshold 15 (max 15)
> - age   1:   10310176 bytes,   10310176 total
> -------------------------
>
> For contrast, here's a gc log fragment from the single node running
> the larger new gen and larger survivor spaces:
> (the fragment is from the same point in time, with the nodes
> experiencing equal load)
> -------------------------
> 2012-03-15T18:32:12.067+0100: 797119.336: [GC 797119.336: [ParNew
> Desired survivor size 69894144 bytes, new threshold 15 (max 15)
> - age   1:    5611536 bytes,    5611536 total
> - age   2:    3731888 bytes,    9343424 total
> - age   3:    3450672 bytes,   12794096 total
> - age   4:    3314744 bytes,   16108840 total
> - age   5:    3459888 bytes,   19568728 total
> - age   6:    3334712 bytes,   22903440 total
> - age   7:    3671960 bytes,   26575400 total
> - age   8:    3841608 bytes,   30417008 total
> - age   9:    2035392 bytes,   32452400 total
> - age  10:    1975056 bytes,   34427456 total
> - age  11:    2021344 bytes,   36448800 total
> - age  12:    1520752 bytes,   37969552 total
> - age  13:    1494176 bytes,   39463728 total
> - age  14:    2355136 bytes,   41818864 total
> - age  15:    1279000 bytes,   43097864 total
> : 603473K->61640K(682688K), 0.0756570 secs]
> 3373284K->2832383K(5106368K), 0.0762090 secs] [Times: user=0.56
> sys=0.00, real=0.08 secs]
> 2012-03-15T18:32:18.200+0100: 797125.468: [GC 797125.469: [ParNew
> Desired survivor size 69894144 bytes, new threshold 15 (max 15)
> - age   1:    6101320 bytes,    6101320 total
> - age   2:    4446776 bytes,   10548096 total
> - age   3:    3701384 bytes,   14249480 total
> - age   4:    3438488 bytes,   17687968 total
> - age   5:    3295360 bytes,   20983328 total
> - age   6:    3403320 bytes,   24386648 total
> - age   7:    3323368 bytes,   27710016 total
> - age   8:    3665760 bytes,   31375776 total
> - age   9:    2427904 bytes,   33803680 total
> - age  10:    1418656 bytes,   35222336 total
> - age  11:    1955192 bytes,   37177528 total
> - age  12:    2006064 bytes,   39183592 total
> - age  13:    1520768 bytes,   40704360 total
> - age  14:    1493728 bytes,   42198088 total
> - age  15:    2354376 bytes,   44552464 total
> : 607816K->62650K(682688K), 0.0779270 secs]
> 3378559K->2834643K(5106368K), 0.0784690 secs] [Times: user=0.58
> sys=0.00, real=0.08 secs]
> -------------------------
>
> Questions:
>
> 1) From the tenuring distributions, it seems that the application
> benefits from larger new gen and survivor spaces.
> The next thing we'll try is to run with -Xmn1g -XX:SurvivorRatio=2,
> and see if the ParNew times are still acceptable.
> Does this seem a sensible approach in this context?
> Are there other variables beyond ParNew times that limit scaling the
> new gen to a large size?
>
> 2) Given the object age demographics inherent to our application, we
> can not expect to see the majority of data get collected in the new
> gen.
>
> Our approach to fight the promotion failures consists of three aspects:
> a) Lower the overall allocation rate of our application (by improving
> wasteful hotspots), to decrease overall ParNew collection frequency.
> b) Configure the new gen and survivor spaces as large as possible,
> keeping an eye on ParNew times and overall new/tenured ratio.
> c) Try to refactor the data structures that form the bulk of promoted
> data, to retain only the strictly required subgraphs.
>
> Is there anything else I can try or measure, in order to better
> understand the problem?
>
> Thanks in advance,
> Taras
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120320/e98ea510/attachment-0001.html 

From ysr1729 at gmail.com  Tue Mar 20 14:12:23 2012
From: ysr1729 at gmail.com (Srinivas Ramakrishna)
Date: Tue, 20 Mar 2012 14:12:23 -0700
Subject: Promotion failures: indication of CMS fragmentation?
In-Reply-To: <CA+R7V78c=D70MJv-jUw2X4ayOOUV0UBZDphcucRSpiYYncoV1A@mail.gmail.com>
References: <CA+R7V78oeNvQwWOjagdANw=h0Ws_p5da7BDeOhguoKT1V5n5dQ@mail.gmail.com>
	<4EF9FCAC.3030208@oracle.com>
	<CA+R7V7-SGdXmbtqo=+2VQwKVnCVCZdj4M=gQfrxiGf2fEMi3cA@mail.gmail.com>
	<4F06A270.3010701@oracle.com>
	<CA+R7V78Twoz0a=J5oCRYJjBdnptPdUv9Jnvt4wiLUsh3Cy+bHw@mail.gmail.com>
	<4F0DBEC4.7040907@oracle.com>
	<CA+R7V7-pxrKH5L2brxZRZwKrv7ZF3aYtQkZmb7-A=nSLn5QfYg@mail.gmail.com>
	<4F1ECE7B.3040502@oracle.com>
	<CA+R7V79x29mXvkEKuPnCYrAJfZjzHc5QnfgrNCYPZFO8GRYayg@mail.gmail.com>
	<4F1F2ED7.6060308@oracle.com>
	<CA+R7V7_P4xdsOMdM+KgiO-urNMiPakQQcdjnOQ_yYo4KZhko2w@mail.gmail.com>
	<4F20F78D.9070905@oracle.com>
	<CA+R7V79M0B2UTqqxiUGfoK-1pMP54e+biBnH+wy=zGEA2vjihg@mail.gmail.com>
	<CA+R7V79F59SJL6F7QvmWAuCKyisv5MFuDvsBfkDuvU0UcZ_iOw@mail.gmail.com>
	<CA+R7V7_st6DPnJZOMUnAeRVeYND42Y19rAUjNJ+PhtF72Ur2mQ@mail.gmail.com>
	<CAG7eTFoeYitaBjgt2eUT3kXqU2SGk1eC5eofdAAL1SjuCFMHCg@mail.gmail.com>
	<CABzyjyna2Mq7EXDiZ8mVB=1MX9Gw1=e2z8zO8X69QeodVKbBrg@mail.gmail.com>
	<CA+R7V79NDPWq2YeX8NQ9j6XW8P7=dZbQWJwzG6dxz4UgUGvKuA@mail.gmail.com>
	<CA+R7V783r2459y7r6zxXNP9_eQ2KOa5Oh0Mr2tPhS1d-8H-ing@mail.gmail.com>
	<CA+R7V78c=D70MJv-jUw2X4ayOOUV0UBZDphcucRSpiYYncoV1A@mail.gmail.com>
Message-ID: <CABzyjynVHStdeACv_o3zyBhc_P_5yrMrqLb+++aWn8cft51PuA@mail.gmail.com>

As Chi-ho noted, about 3-4 MB of data does get promoted per scavenge,
after having
sloshed around in your survivor spaces some 15 times. I'd venture that
whatever winnowing
of young objects was to ocur has in fact occured already within the
first 3-4 scavenges that
an object has survived, after which the drop-off in population is less
sharp. So I'd suggest
lowering the MTT to about 3, while leaving the survivor ratio intact.
That should reduce your
copying costs and bring down your scavenge pauses further, while not
adversely affecting
your promotion rates (and concomitantly the fragmentation).

One thing that was a bit puzzling about the stats below was that you'd
expect the volume
of generation X in scavenge N to be no less than the volume of
generation X+1 in scavenge N+1,
but occasionally that natural invariant does not appear to hold, which
is quite puzzling --
indicating perhaps that either ages or populations are not being
correctly tracked.

I don't know if anyone else has noticed that in their tenuring
distributions as well....

-- ramki

On Tue, Mar 20, 2012 at 9:36 AM, Taras Tielkes <taras.tielkes at gmail.com> wrote:
> Hi,
>
> I've collected -XX:+PrintTenuringDistribution data from a node in our
> production environment, running -Xmx5g -Xmn400m -XX:SurvivorRatio=8.
> On one other production node, we've configured a larger new gen, and
> larger survivor spaces (-Xmx5g -Xmn800m -XX:SurvivorRatio=4).
> This node has -XX:+PrintTenuringDistribution logging as well.
>
> The node running the larger new gen and survivor spaces has not run
> into a promotion failure yet, while the ones still running the old
> config have hit a few.
> The promotion failures are typically experienced at high load periods,
> which makes sense, as allocation and promotion will experience a spike
> in those periods as well.
>
> The inherent nature of the application implies relatively long
> sessions (towards a few hours), retaining a fair amout of state up to
> an hour.
> I believe this is the main reason of the relatively high promotion
> rate we're experiencing.
>
>
> Here's a fragment of gc log from one of the nodes running the older
> (smaller) new gen, including a promotion failure:
> -------------------------
> 2012-03-15T18:32:17.785+0100: 796604.225: [GC 796604.225: [ParNew
> Desired survivor size 20971520 bytes, new threshold 8 (max 15)
> - age ? 1: ? ?2927728 bytes, ? ?2927728 total
> - age ? 2: ? ?2428512 bytes, ? ?5356240 total
> - age ? 3: ? ?2696376 bytes, ? ?8052616 total
> - age ? 4: ? ?2623576 bytes, ? 10676192 total
> - age ? 5: ? ?3365576 bytes, ? 14041768 total
> - age ? 6: ? ?2792272 bytes, ? 16834040 total
> - age ? 7: ? ?2233008 bytes, ? 19067048 total
> - age ? 8: ? ?2263824 bytes, ? 21330872 total
> : 358709K->29362K(368640K), 0.0461460 secs]
> 3479492K->3151874K(5201920K), 0.0467320 secs] [Times: user=0.34
> sys=0.01, real=0.05 secs]
> 2012-03-15T18:32:21.546+0100: 796607.986: [GC 796607.986: [ParNew (0:
> promotion failure size = 25) ?(1: promotion failure size = 25) ?(2:
> promotion failure size = 25) ?(3: promotion failure size = 25) ?(4:
> promotion failure size = 25) ?(5
> : promotion failure size = 25) ?(6: promotion failure size = 341) ?(7:
> promotion failure size = 25) ?(promotion failed)
> Desired survivor size 20971520 bytes, new threshold 8 (max 15)
> - age ? 1: ? ?3708208 bytes, ? ?3708208 total
> - age ? 2: ? ?2174384 bytes, ? ?5882592 total
> - age ? 3: ? ?2383256 bytes, ? ?8265848 total
> - age ? 4: ? ?2689912 bytes, ? 10955760 total
> - age ? 5: ? ?2621832 bytes, ? 13577592 total
> - age ? 6: ? ?3360440 bytes, ? 16938032 total
> - age ? 7: ? ?2784136 bytes, ? 19722168 total
> - age ? 8: ? ?2220232 bytes, ? 21942400 total
> : 357042K->356456K(368640K), 0.2734100 secs]796608.259: [CMS:
> 3124189K->516640K(4833280K), 6.8127070 secs]
> 3479554K->516640K(5201920K), [CMS Perm : 142423K->142284K(262144K)],
> 7.0867850 secs] [Times: user=7.32 sys=0.07, real=7.09 secs]
> 2012-03-15T18:32:30.279+0100: 796616.719: [GC 796616.720: [ParNew
> Desired survivor size 20971520 bytes, new threshold 1 (max 15)
> - age ? 1: ? 29721456 bytes, ? 29721456 total
> : 327680K->40960K(368640K), 0.0403130 secs]
> 844320K->557862K(5201920K), 0.0409070 secs] [Times: user=0.27
> sys=0.01, real=0.04 secs]
> 2012-03-15T18:32:32.701+0100: 796619.141: [GC 796619.141: [ParNew
> Desired survivor size 20971520 bytes, new threshold 15 (max 15)
> - age ? 1: ? 10310176 bytes, ? 10310176 total
> -------------------------
>
> For contrast, here's a gc log fragment from the single node running
> the larger new gen and larger survivor spaces:
> (the fragment is from the same point in time, with the nodes
> experiencing equal load)
> -------------------------
> 2012-03-15T18:32:12.067+0100: 797119.336: [GC 797119.336: [ParNew
> Desired survivor size 69894144 bytes, new threshold 15 (max 15)
> - age ? 1: ? ?5611536 bytes, ? ?5611536 total
> - age ? 2: ? ?3731888 bytes, ? ?9343424 total
> - age ? 3: ? ?3450672 bytes, ? 12794096 total
> - age ? 4: ? ?3314744 bytes, ? 16108840 total
> - age ? 5: ? ?3459888 bytes, ? 19568728 total
> - age ? 6: ? ?3334712 bytes, ? 22903440 total
> - age ? 7: ? ?3671960 bytes, ? 26575400 total
> - age ? 8: ? ?3841608 bytes, ? 30417008 total
> - age ? 9: ? ?2035392 bytes, ? 32452400 total
> - age ?10: ? ?1975056 bytes, ? 34427456 total
> - age ?11: ? ?2021344 bytes, ? 36448800 total
> - age ?12: ? ?1520752 bytes, ? 37969552 total
> - age ?13: ? ?1494176 bytes, ? 39463728 total
> - age ?14: ? ?2355136 bytes, ? 41818864 total
> - age ?15: ? ?1279000 bytes, ? 43097864 total
> : 603473K->61640K(682688K), 0.0756570 secs]
> 3373284K->2832383K(5106368K), 0.0762090 secs] [Times: user=0.56
> sys=0.00, real=0.08 secs]
> 2012-03-15T18:32:18.200+0100: 797125.468: [GC 797125.469: [ParNew
> Desired survivor size 69894144 bytes, new threshold 15 (max 15)
> - age ? 1: ? ?6101320 bytes, ? ?6101320 total
> - age ? 2: ? ?4446776 bytes, ? 10548096 total
> - age ? 3: ? ?3701384 bytes, ? 14249480 total
> - age ? 4: ? ?3438488 bytes, ? 17687968 total
> - age ? 5: ? ?3295360 bytes, ? 20983328 total
> - age ? 6: ? ?3403320 bytes, ? 24386648 total
> - age ? 7: ? ?3323368 bytes, ? 27710016 total
> - age ? 8: ? ?3665760 bytes, ? 31375776 total
> - age ? 9: ? ?2427904 bytes, ? 33803680 total
> - age ?10: ? ?1418656 bytes, ? 35222336 total
> - age ?11: ? ?1955192 bytes, ? 37177528 total
> - age ?12: ? ?2006064 bytes, ? 39183592 total
> - age ?13: ? ?1520768 bytes, ? 40704360 total
> - age ?14: ? ?1493728 bytes, ? 42198088 total
> - age ?15: ? ?2354376 bytes, ? 44552464 total
> : 607816K->62650K(682688K), 0.0779270 secs]
> 3378559K->2834643K(5106368K), 0.0784690 secs] [Times: user=0.58
> sys=0.00, real=0.08 secs]
> -------------------------
>
> Questions:
>
> 1) From the tenuring distributions, it seems that the application
> benefits from larger new gen and survivor spaces.
> The next thing we'll try is to run with -Xmn1g -XX:SurvivorRatio=2,
> and see if the ParNew times are still acceptable.
> Does this seem a sensible approach in this context?
> Are there other variables beyond ParNew times that limit scaling the
> new gen to a large size?
>
> 2) Given the object age demographics inherent to our application, we
> can not expect to see the majority of data get collected in the new
> gen.
>
> Our approach to fight the promotion failures consists of three aspects:
> a) Lower the overall allocation rate of our application (by improving
> wasteful hotspots), to decrease overall ParNew collection frequency.
> b) Configure the new gen and survivor spaces as large as possible,
> keeping an eye on ParNew times and overall new/tenured ratio.
> c) Try to refactor the data structures that form the bulk of promoted
> data, to retain only the strictly required subgraphs.
>
> Is there anything else I can try or measure, in order to better
> understand the problem?
>
> Thanks in advance,
> Taras
>
>
> On Wed, Feb 22, 2012 at 10:51 AM, Taras Tielkes <taras.tielkes at gmail.com> wrote:
>> (this time properly responding to the list alias)
>> Hi Srinivas,
>>
>> We're running 1.6.0 u29 on Linux x64. My understanding is that
>> CompressedOops is enabled by default since u23.
>>
>> At least this page seems to support that:
>> http://docs.oracle.com/javase/7/docs/technotes/guides/vm/performance-enhancements-7.html
>>
>> Regarding the other remarks (also from Todd and Chi), I'll comment
>> later. The first thing on my list is to collect
>> PrintTenuringDistribution data now.
>>
>> Kind regards,
>> Taras
>>
>> On Wed, Feb 22, 2012 at 10:50 AM, Taras Tielkes <taras.tielkes at gmail.com> wrote:
>>> Hi Srinivas,
>>>
>>> We're running 1.6.0 u29 on Linux x64. My understanding is that
>>> CompressedOops is enabled by default since u23.
>>>
>>> At least this page seems to support that:
>>> http://docs.oracle.com/javase/7/docs/technotes/guides/vm/performance-enhancements-7.html
>>>
>>> Regarding the other remarks (also from Todd and Chi), I'll comment
>>> later. The first thing on my list is to collect
>>> PrintTenuringDistribution data now.
>>>
>>> Kind regards,
>>> Taras
>>>
>>> On Wed, Feb 22, 2012 at 12:40 AM, Srinivas Ramakrishna
>>> <ysr1729 at gmail.com> wrote:
>>>> I agree that premature promotions are almost always the first and most
>>>> important thing to fix when running
>>>> into fragmentation or overload issues with CMS. However, I can also imagine
>>>> long-lived objects with a highly
>>>> non-stationary size distribution which can also cause problems for CMS
>>>> despite best efforts to tune against
>>>> premature promotion.
>>>>
>>>> I didn't think Treas was running with MTT=0, although MTT > 0 is no recipe
>>>> for avoiding premature promotion
>>>> with bursty loads that case overflow the survivor spaces -- as you say large
>>>> survivor spaces with a low
>>>> TargetSurvivorRatio -- so as to leave plenty of space to absorb/accommodate
>>>> spiking/bursty loads? is
>>>> definitely a "best practice" for CMS (and possibly for other concurrent
>>>> collectors as well).
>>>>
>>>> One thing Taras can do to see if premature promotion might be an issue is to
>>>> look at the tenuring
>>>> threshold in his case. A rough proxy (if PrintTenuringDistribution is not
>>>> enabled) is to look at the
>>>> promotion volume per scavenge. It may be possible, if premature promotion is
>>>> a cause, to see
>>>> some kind of medium-term correlation between high promotion volume and
>>>> eventual promotion
>>>> failure despite frequent CMS collections.
>>>>
>>>> One other point which may or may not be relevant. I see that Taras is not
>>>> using CompressedOops...
>>>> Using that alone would greatly decrease memory pressure and provide more
>>>> breathing room to CMS,
>>>> which is also almost always a good idea.
>>>>
>>>> -- ramki
>>>>
>>>> On Tue, Feb 21, 2012 at 10:16 AM, Chi Ho Kwok <chkwok at digibites.nl> wrote:
>>>>>
>>>>> Hi Teras,
>>>>>
>>>>> I think you may want to look into sizing the new and especially the
>>>>> survivor spaces differently. We run something similar to what you described,
>>>>> high volume request processing with large dataset loading, and what we've
>>>>> seen at the start is that the survivor spaces are completely overloaded,
>>>>> causing premature promotions.
>>>>>
>>>>> We've configured our vm with the following goals/guideline:
>>>>>
>>>>> old space is for semi-permanent data, living for at least 30s, average ~10
>>>>> minutes
>>>>> new space contains only temporary and just loaded data
>>>>> surviving objects from new should never reach old in 1 gc, so the survivor
>>>>> space may never be 100% full
>>>>>
>>>>> With jstat -gcutil `pidof java` 2000, we see things like:
>>>>>
>>>>> ? S0 ? ? S1 ? ? E ? ? ?O ? ? ?P ? ? YGC ? ? YGCT ? ?FGC ? ?FGCT ? ? GCT
>>>>> ?70.20 ? 0.00 ?19.65 ?57.60 ?59.90 124808 29474.299 ?2498 ?191.110
>>>>> 29665.409
>>>>> ?70.20 ? 0.00 ?92.89 ?57.60 ?59.90 124808 29474.299 ?2498 ?191.110
>>>>> 29665.409
>>>>> ?70.20 ? 0.00 ?93.47 ?57.60 ?59.90 124808 29474.299 ?2498 ?191.110
>>>>> 29665.409
>>>>> ? 0.00 ?65.69 ?78.07 ?58.09 ?59.90 124809 29474.526 ?2498 ?191.110
>>>>> 29665.636
>>>>> ?84.97 ? 0.00 ?48.19 ?58.57 ?59.90 124810 29474.774 ?2498 ?191.110
>>>>> 29665.884
>>>>> ?84.97 ? 0.00 ?81.30 ?58.57 ?59.90 124810 29474.774 ?2498 ?191.110
>>>>> 29665.884
>>>>> ? 0.00 ?62.64 ?27.22 ?59.12 ?59.90 124811 29474.992 ?2498 ?191.110
>>>>> 29666.102
>>>>> ? 0.00 ?62.64 ?54.47 ?59.12 ?59.90 124811 29474.992 ?2498 ?191.110
>>>>> 29666.102
>>>>> ?75.68 ? 0.00 ? 6.80 ?59.53 ?59.90 124812 29475.228 ?2498 ?191.110
>>>>> 29666.338
>>>>> ?75.68 ? 0.00 ?23.38 ?59.53 ?59.90 124812 29475.228 ?2498 ?191.110
>>>>> 29666.338
>>>>> ?75.68 ? 0.00 ?27.72 ?59.53 ?59.90 124812 29475.228 ?2498 ?191.110
>>>>> 29666.338
>>>>>
>>>>> If you follow the lines, you can see Eden fill up to 100% on line 4,
>>>>> surviving objects are copied into S1, S0 is collected and added 0.49% to
>>>>> Old. On line 5, another GC happened, with Eden->S0, S1->Old, etc. No objects
>>>>> is ever transferred from Eden to Old, unless there's a huge peak of
>>>>> requests.
>>>>>
>>>>> This is with a: 32GB heap, Mxn1200M, SurvivorRatio 2 (600MB Eden, 300MB
>>>>> S0, 300MB S1), MaxTenuringThreshold 1 (whatever is still alive in S0/1 on
>>>>> the second GC is copied to old, don't wait, web requests are quite bursty).
>>>>> With about 1 collection every 2-5 seconds, objects promoted to Old must live
>>>>> for at 4-10 seconds; as that's longer than an average request (50ms-1s),
>>>>> none of the temporary data ever makes it into Old, which is much more
>>>>> expensive to collect. It works even with a higher than default
>>>>> CMSInitiatingOccupancyFraction=76 to optimize for space available for the
>>>>> large data cache we have.
>>>>>
>>>>>
>>>>> With your config of 400MB Total new, with 350MB Eden, 25MB S0, 25MB S1
>>>>> (SurvivorRatio 8), no tenuring threshold, I think loads of new objects get
>>>>> copied from Eden to Old directly, causing trouble for the CMS. You can use
>>>>> jstat to get live stats and tweak until it doesn't happen. If you can't make
>>>>> changes on live that easil, try doubling the new size indeed, with a 400
>>>>> Eden, 200 S0, 200 S1 and?MaxTenuringThreshold?1 setting. It's probably
>>>>> overkill, but if should solve the problem if it is caused by premature
>>>>> promotion.
>>>>>
>>>>>
>>>>> Chi Ho Kwok
>>>>>
>>>>>
>>>>> On Tue, Feb 21, 2012 at 5:55 PM, Taras Tielkes <taras.tielkes at gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We've removed the "-XX:+CMSScavengeBeforeRemark" setting from 50% of
>>>>>> our production nodes.
>>>>>> After running for a few weeks, it seems that there's no impact from
>>>>>> removing this option.
>>>>>> Which is good, since it seems we can remove it from the other nodes as
>>>>>> well, simplifying our overall JVM configuration ;-)
>>>>>>
>>>>>> However, we're still seeing promotion failures on all nodes, once
>>>>>> every day or so.
>>>>>>
>>>>>> There's still the "Magic 1026": this accounts for ~60% of the
>>>>>> promotion failures that we're seeing (single ParNew thread thread,
>>>>>> 1026 failure size):
>>>>>> --------------------
>>>>>> 2012-02-06T09:13:51.806+0100: 328095.085: [GC 328095.086: [ParNew:
>>>>>> 359895K->29357K(368640K), 0.0429070 secs]
>>>>>> 3471021K->3143476K(5201920K), 0.0434950 secs] [Times: user=0.32
>>>>>> sys=0.00, real=0.04 secs]
>>>>>> 2012-02-06T09:13:55.922+0100: 328099.201: [GC 328099.201: [ParNew:
>>>>>> 357037K->31817K(368640K), 0.0429130 secs]
>>>>>> 3471156K->3148946K(5201920K), 0.0434930 secs] [Times: user=0.31
>>>>>> sys=0.00, real=0.04 secs]
>>>>>> 2012-02-06T09:13:59.044+0100: 328102.324: [GC 328102.324: [ParNew
>>>>>> (promotion failure size = 1026) ?(promotion failed):
>>>>>> 359497K->368640K(368640K), 0.2226790 secs]328102.547: [CMS:
>>>>>> 3125609K->451515K(4833280K), 5.6225880 secs] 3476626K->4515
>>>>>> 15K(5201920K), [CMS Perm : 124373K->124353K(262144K)], 5.8459380 secs]
>>>>>> [Times: user=6.20 sys=0.01, real=5.85 secs]
>>>>>> 2012-02-06T09:14:05.243+0100: 328108.522: [GC 328108.523: [ParNew:
>>>>>> 327680K->40960K(368640K), 0.0319160 secs] 779195K->497658K(5201920K),
>>>>>> 0.0325360 secs] [Times: user=0.21 sys=0.01, real=0.03 secs]
>>>>>> 2012-02-06T09:14:07.836+0100: 328111.116: [GC 328111.116: [ParNew:
>>>>>> 368640K->32785K(368640K), 0.0744670 secs] 825338K->520234K(5201920K),
>>>>>> 0.0750390 secs] [Times: user=0.40 sys=0.02, real=0.08 secs]
>>>>>> --------------------
>>>>>> Given the 1026 word size, I'm wondering if I should be hunting for an
>>>>>> overuse of BufferedInputStream/BufferedOutoutStream, since both have
>>>>>> 8192 as a default buffer size.
>>>>>>
>>>>>> The second group of promotion failures look like this (multiple ParNew
>>>>>> threads, small failure sizes):
>>>>>> --------------------
>>>>>> 2012-02-06T09:50:15.773+0100: 328756.964: [GC 328756.964: [ParNew:
>>>>>> 356116K->29934K(368640K), 0.0461100 secs]
>>>>>> 3203863K->2880162K(5201920K), 0.0468870 secs] [Times: user=0.34
>>>>>> sys=0.01, real=0.05 secs]
>>>>>> 2012-02-06T09:50:19.153+0100: 328760.344: [GC 328760.344: [ParNew:
>>>>>> 357614K->30359K(368640K), 0.0454680 secs]
>>>>>> 3207842K->2882892K(5201920K), 0.0462280 secs] [Times: user=0.33
>>>>>> sys=0.01, real=0.05 secs]
>>>>>> 2012-02-06T09:50:22.658+0100: 328763.849: [GC 328763.849: [ParNew (1:
>>>>>> promotion failure size = 25) ?(4: promotion failure size = 25) ?(6:
>>>>>> promotion failure size = 25) ?(7: promotion failure size = 144)
>>>>>> (promotion failed): 358039K->358358
>>>>>> K(368640K), 0.2148680 secs]328764.064: [CMS:
>>>>>> 2854709K->446750K(4833280K), 5.8368270 secs]
>>>>>> 3210572K->446750K(5201920K), [CMS Perm : 124670K->124644K(262144K)],
>>>>>> 6.0525230 secs] [Times: user=6.32 sys=0.00, real=6.05 secs]
>>>>>> 2012-02-06T09:50:29.896+0100: 328771.086: [GC 328771.087: [ParNew:
>>>>>> 327680K->22569K(368640K), 0.0227080 secs] 774430K->469319K(5201920K),
>>>>>> 0.0235020 secs] [Times: user=0.16 sys=0.00, real=0.02 secs]
>>>>>> 2012-02-06T09:50:31.076+0100: 328772.266: [GC 328772.267: [ParNew:
>>>>>> 350249K->22264K(368640K), 0.0235480 secs] 796999K->469014K(5201920K),
>>>>>> 0.0243000 secs] [Times: user=0.18 sys=0.01, real=0.02 secs]
>>>>>> --------------------
>>>>>>
>>>>>> We're going to try to double the new size on a single node, to see the
>>>>>> effects of that.
>>>>>>
>>>>>> Beyond this experiment, is there any additional data I can collect to
>>>>>> better understand the nature of the promotion failures?
>>>>>> Am I facing collecting free list statistics at this point?
>>>>>>
>>>>>> Thanks,
>>>>>> Taras
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> hotspot-gc-use mailing list
>>>>> hotspot-gc-use at openjdk.java.net
>>>>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>>>>
>>>>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use

From kbbryant61 at gmail.com  Wed Mar 28 12:33:54 2012
From: kbbryant61 at gmail.com (Kobe Bryant)
Date: Wed, 28 Mar 2012 12:33:54 -0700
Subject: JDK6 YoungGen Layout
Message-ID: <CAMJNr9=EHHdT=RBfW+wg_OGtZNxzybY0tVoO=+oGvc7xURK2Yg@mail.gmail.com>

i am using jdk1.6. I configureg Xmx = 2G, Xms= 2G, NewSize = MaxNewSize =
660m

I enable verbose gc. I see this which I am not understanding:


PSYoungGen      total 608256K, used 32440K
 eden space 540672K, 6% used
 from space 67584K, 0% used
 to   space 67584K, 0% used


According to my configuration YoungGen size is 675840MB. Because younGen =
Eden + two survivor spaces,
which matches my understanding what the YoungGen size should be.

GC log saying that my YoungGen size = 608256K which is not what i've
configured. Also if I add Eden space and two survivor spaces I am getting
540672K + 67584K + 67584K = 675840K

Eden + 1 Survivor space = 540672K + 67584K = 675840K, what i have
configured. So does this mean that YoungGen = Eden + 1 Survivor space? I
think there are two survivor space, correct?

Please explain me.

thanking you
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120328/97f9e51b/attachment-0001.html 

From ysr1729 at gmail.com  Wed Mar 28 14:26:30 2012
From: ysr1729 at gmail.com (Srinivas Ramakrishna)
Date: Wed, 28 Mar 2012 14:26:30 -0700
Subject: JDK6 YoungGen Layout
In-Reply-To: <CAMJNr9=EHHdT=RBfW+wg_OGtZNxzybY0tVoO=+oGvc7xURK2Yg@mail.gmail.com>
References: <CAMJNr9=EHHdT=RBfW+wg_OGtZNxzybY0tVoO=+oGvc7xURK2Yg@mail.gmail.com>
Message-ID: <CABzyjykLd568dirFUS6SjkxHimRR0NJypEMDUAKqZ8h5_G0MyQ@mail.gmail.com>

On Wed, Mar 28, 2012 at 12:33 PM, Kobe Bryant <kbbryant61 at gmail.com> wrote:

>
>
>   i am using jdk1.6. I configureg Xmx = 2G, Xms= 2G, NewSize = MaxNewSize
> = 660m
>
> I enable verbose gc. I see this which I am not understanding:
>
>
> PSYoungGen      total 608256K, used 32440K
>
>  eden space 540672K, 6% used
>
>  from space 67584K, 0% used
>
>  to   space 67584K, 0% used
>
>
>  According to my configuration YoungGen size is 675840MB. Because younGen
> = Eden + two survivor spaces,
> which matches my understanding what the YoungGen size should be.
>
> GC log saying that my YoungGen size = 608256K which is not what i've
> configured. Also if I add Eden space and two survivor spaces I am getting
> 540672K + 67584K + 67584K = 675840K
>
> Eden + 1 Survivor space = 540672K + 67584K = 675840K, what i have
> configured. So does this mean that YoungGen = Eden + 1 Survivor space? I
> think there are two survivor space, correct?
>

You are right. For historical reasons, the JVM produces under "total" for
young gen, only that space
that can have objects. Since only one of the survivor spaces (and Eden) can
hold java objects at any
one time, the message states that that's the "size" of the young gen.

Hope that helps clear the confusion. (Ask again if i did not understand yr
question.)
-- ramki


> Please explain me.
>
> thanking you
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120328/30b9d086/attachment-0001.html 

From kbbryant61 at gmail.com  Fri Mar 30 13:32:06 2012
From: kbbryant61 at gmail.com (Kobe Bryant)
Date: Fri, 30 Mar 2012 13:32:06 -0700
Subject: Puzzling - why is a Full GC triggered here?
Message-ID: <CAMJNr9m58YZeOy=8EnTGzO17paYjSxP7QixmtmCAo3Oo3WFHDA@mail.gmail.com>

The following GC trace was obtained soon after the JVM was started
(a lot of transient objects are generated as a part of the start up and
hence
a lot of minor GCs).

( Xms = Xmx = 2G, NewGen = 1/3 of 2G, PermGen = 760m)

I see Full GC triggered (highlighted in bold below): the object space
occupancy
in tenured space is 2% - and yet a Full GC is triggered (I have elided
PermGen reports, as PermGen usage is about 2% and is healthy)

{Heap before GC invocations=2 (full 1):
 PSYoungGen      total 608256K, used 2016K [0x00000007d6c00000,
0x0000000800000000, 0x0000000800000000)
  eden space 540672K, 0% used
[0x00000007d6c00000,0x00000007d6c00000,0x00000007f7c00000)
  from space 67584K, 2% used
[0x00000007f7c00000,0x00000007f7df8030,0x00000007fbe00000)
  to   space 67584K, 0% used
[0x00000007fbe00000,0x00000007fbe00000,0x0000000800000000)
 PSOldGen        total 1421312K, used 0K [0x0000000780000000,
0x00000007d6c00000, 0x00000007d6c00000)
  object space 1421312K, 0% used
[0x0000000780000000,0x0000000780000000,0x00000007d6c00000)
 PSPermGen       total 524288K, used 10939K [0x0000000750400000,
0x0000000770400000, 0x0000000780000000)
  object space 524288K, 2% used
[0x0000000750400000,0x0000000750eaef28,0x0000000770400000)
*2012-03-26T22:35:49.121-0700: 0.614: [Full GC (System)AdaptiveSizeStart:
0.645 collection: 2*
AdaptiveSizeStop: collection: 2
 [PSYoungGen: 2016K->0K(608256K)] [PSOldGen: 0K->1809K(1421312K)]
2016K->1809K(2029568K) [PSPermGen: 10939K->10939K(524288K)], 0.0317230
secs] [Times: user=0.03 sys=0.00, real=0.03 secs]

Why was full GC triggered here? Is it because RMI DGC? Is there a way to
identify if DGC is triggering GCs?
PermGen is at a healthy and consistent level of 2%. I have hence elided
this detail above. And hence,
this cannot be the cause of full GCs.

thanking you
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120330/502e7d2e/attachment.html 

From jon.masamitsu at oracle.com  Fri Mar 30 13:57:49 2012
From: jon.masamitsu at oracle.com (Jon Masamitsu)
Date: Fri, 30 Mar 2012 13:57:49 -0700
Subject: Puzzling - why is a Full GC triggered here?
In-Reply-To: <CAMJNr9m58YZeOy=8EnTGzO17paYjSxP7QixmtmCAo3Oo3WFHDA@mail.gmail.com>
References: <CAMJNr9m58YZeOy=8EnTGzO17paYjSxP7QixmtmCAo3Oo3WFHDA@mail.gmail.com>
Message-ID: <4F761E4D.7040103@oracle.com>

Is this the full GC in question?

*2012-03-26T22:35:49.121-0700: 0.614: [Full GC (System)AdaptiveSizeStart:
0.645 collection: 2*

This one is caused by an explicit call to system.GC().  Don't know if it 
has to
do with RMI/distributed GC.  There's some flag to change the time between
explicit GC's for RMI.  You could change them and  see if this goes away.

-Dsun.rmi.dgc.client.gcInterval=3600000
-Dsun.rmi.dgc.server.gcInterval=3600000


On 3/30/2012 1:32 PM, Kobe Bryant wrote:
> The following GC trace was obtained soon after the JVM was started
> (a lot of transient objects are generated as a part of the start up and
> hence
> a lot of minor GCs).
>
> ( Xms = Xmx = 2G, NewGen = 1/3 of 2G, PermGen = 760m)
>
> I see Full GC triggered (highlighted in bold below): the object space
> occupancy
> in tenured space is 2% - and yet a Full GC is triggered (I have elided
> PermGen reports, as PermGen usage is about 2% and is healthy)
>
> {Heap before GC invocations=2 (full 1):
>   PSYoungGen      total 608256K, used 2016K [0x00000007d6c00000,
> 0x0000000800000000, 0x0000000800000000)
>    eden space 540672K, 0% used
> [0x00000007d6c00000,0x00000007d6c00000,0x00000007f7c00000)
>    from space 67584K, 2% used
> [0x00000007f7c00000,0x00000007f7df8030,0x00000007fbe00000)
>    to   space 67584K, 0% used
> [0x00000007fbe00000,0x00000007fbe00000,0x0000000800000000)
>   PSOldGen        total 1421312K, used 0K [0x0000000780000000,
> 0x00000007d6c00000, 0x00000007d6c00000)
>    object space 1421312K, 0% used
> [0x0000000780000000,0x0000000780000000,0x00000007d6c00000)
>   PSPermGen       total 524288K, used 10939K [0x0000000750400000,
> 0x0000000770400000, 0x0000000780000000)
>    object space 524288K, 2% used
> [0x0000000750400000,0x0000000750eaef28,0x0000000770400000)
> *2012-03-26T22:35:49.121-0700: 0.614: [Full GC (System)AdaptiveSizeStart:
> 0.645 collection: 2*
> AdaptiveSizeStop: collection: 2
>   [PSYoungGen: 2016K->0K(608256K)] [PSOldGen: 0K->1809K(1421312K)]
> 2016K->1809K(2029568K) [PSPermGen: 10939K->10939K(524288K)], 0.0317230
> secs] [Times: user=0.03 sys=0.00, real=0.03 secs]
>
> Why was full GC triggered here? Is it because RMI DGC? Is there a way to
> identify if DGC is triggering GCs?
> PermGen is at a healthy and consistent level of 2%. I have hence elided
> this detail above. And hence,
> this cannot be the cause of full GCs.
>
> thanking you
>
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120330/793eb319/attachment.html 

From rednaxelafx at gmail.com  Fri Mar 30 19:14:04 2012
From: rednaxelafx at gmail.com (Krystal Mok)
Date: Sat, 31 Mar 2012 10:14:04 +0800
Subject: Puzzling - why is a Full GC triggered here?
In-Reply-To: <CAMJNr9m58YZeOy=8EnTGzO17paYjSxP7QixmtmCAo3Oo3WFHDA@mail.gmail.com>
References: <CAMJNr9m58YZeOy=8EnTGzO17paYjSxP7QixmtmCAo3Oo3WFHDA@mail.gmail.com>
Message-ID: <CA+cQ+tRd8YDmREq8bpv6b_Hj6Uy3H0mVS32JAd5_gVAxzTjtPQ@mail.gmail.com>

Hi,

I had a BTrace script that could trace System.gc() calls and print the
stack trace [1]. If you run the script with a BTrace agent from the start
of your program, you might be able to track down the caller [2]

HTH,
- Kris

[1]:
https://gist.github.com/2000950/37b0095b1edbb7d4a43cc5b39cbe148c8184d3aa#file_trace_system_gc_call.java
[2]: http://kenai.com/projects/btrace/pages/UserGuide

On Sat, Mar 31, 2012 at 4:32 AM, Kobe Bryant <kbbryant61 at gmail.com> wrote:

> The following GC trace was obtained soon after the JVM was started
> (a lot of transient objects are generated as a part of the start up and
> hence
> a lot of minor GCs).
>
> ( Xms = Xmx = 2G, NewGen = 1/3 of 2G, PermGen = 760m)
>
> I see Full GC triggered (highlighted in bold below): the object space
> occupancy
> in tenured space is 2% - and yet a Full GC is triggered (I have elided
> PermGen reports, as PermGen usage is about 2% and is healthy)
>
> {Heap before GC invocations=2 (full 1):
>  PSYoungGen      total 608256K, used 2016K [0x00000007d6c00000,
> 0x0000000800000000, 0x0000000800000000)
>   eden space 540672K, 0% used
> [0x00000007d6c00000,0x00000007d6c00000,0x00000007f7c00000)
>   from space 67584K, 2% used
> [0x00000007f7c00000,0x00000007f7df8030,0x00000007fbe00000)
>   to   space 67584K, 0% used
> [0x00000007fbe00000,0x00000007fbe00000,0x0000000800000000)
>  PSOldGen        total 1421312K, used 0K [0x0000000780000000,
> 0x00000007d6c00000, 0x00000007d6c00000)
>   object space 1421312K, 0% used
> [0x0000000780000000,0x0000000780000000,0x00000007d6c00000)
>  PSPermGen       total 524288K, used 10939K [0x0000000750400000,
> 0x0000000770400000, 0x0000000780000000)
>   object space 524288K, 2% used
> [0x0000000750400000,0x0000000750eaef28,0x0000000770400000)
> *2012-03-26T22:35:49.121-0700: 0.614: [Full GC (System)AdaptiveSizeStart:
> 0.645 collection: 2*
> AdaptiveSizeStop: collection: 2
>  [PSYoungGen: 2016K->0K(608256K)] [PSOldGen: 0K->1809K(1421312K)]
> 2016K->1809K(2029568K) [PSPermGen: 10939K->10939K(524288K)], 0.0317230
> secs] [Times: user=0.03 sys=0.00, real=0.03 secs]
>
> Why was full GC triggered here? Is it because RMI DGC? Is there a way to
> identify if DGC is triggering GCs?
> PermGen is at a healthy and consistent level of 2%. I have hence elided
> this detail above. And hence,
> this cannot be the cause of full GCs.
>
> thanking you
>
>
>
> _______________________________________________
> hotspot-gc-use mailing list
> hotspot-gc-use at openjdk.java.net
> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/attachments/20120331/7e73b2a4/attachment.html 

From gbowyer at fastmail.co.uk  Sat Mar 31 20:26:50 2012
From: gbowyer at fastmail.co.uk (Greg Bowyer)
Date: Sat, 31 Mar 2012 20:26:50 -0700
Subject: Question about hprof binary heap dump format
Message-ID: <4F77CAFA.2090107@fastmail.co.uk>

Hi all,

This is probably the wrong forum to ask however.

I am writing a simple C program that can take hprof heap dumps (the ones 
from OutOfMemoryError) and extract out a few interesting statistics from 
them.

When I dump out the HPROF_HEAP_SUMMARY record I am surprised to find 
that I get multiple heap summaries, is this a summary for each region in 
the heap, is it a deprecated record type or am I missing some deeper truth.

Also for my 8Gb used / 12Gb max heap (a ~7.9Gb hprof file) I cant 
reconcile the numbers I see for the allocated bytes to fit with what I 
would expect the regions to be, I am assuming that the attributes for 
the heap summaries records are encoded as big endian (however if I treat 
these things as little endian, I also get odd numbers)

Are there any pointers to what I am doing wrong here ?

-- Greg

--- %< ---
Heap summary:
     Reachable bytes=0
     Reachable instances=285292896
     Allocated bytes=1986080617
     Allocated instances=1886745683

Heap summary:
     Reachable bytes=5
     Reachable instances=597250776
     Allocated bytes=0
     Allocated instances=0

Heap summary:
     Reachable bytes=0
     Reachable instances=0
     Allocated bytes=0
     Allocated instances=0

Heap summary:
     Reachable bytes=170629377
     Reachable instances=170631681
     Allocated bytes=170637825
     Allocated instances=170644737

Heap summary:
     Reachable bytes=37
     Reachable instances=1
     Allocated bytes=37
     Allocated instances=1