Does allocation performance vary by collector?

Wed Apr 14 18:33:56 UTC 2010

Apart from the fact that the size of G1's regions might cap the TLAB 
sizes, the presence of heap regions also introduces a second slow path 
to the allocation code. When a thread tries to allocate a new TLAB but 
the current allocating region is full, it needs to "retire" the current 
one and grab a new one region.

To increase the heap region size you can use the G1HeapRegionSize=N 
parameter. Currently, the allowed range is between 1m and 32m.

To see what the heap region size the JVM has chosen for your app, enable 
-XX:+PrintHeapAtGC which will show that information.

And I totally agree with Ramki that just showing averages might not be 
very productive and you'll be better off showing distribution statistics.

Tony

Y. Srinivas Ramakrishna wrote:
> Hi Matt -- if you are really trying to measure pure allocation throughput
> you might want to completely eliminate GC overhead by making sure
> yr instrumentation collects figures over an interval during which
> no GC activity intervenes.
>
> I would typically expect NUMA(Parallel) to be better than the rest,
> but just as you stated in your example certain allocation+use patterns
> could degrade performance. Other than that (except for G1 see below)
> all other configurations (modulo GC overhead remarks above) should show
> similar allocation performance.
>
> G1 uses "heap regions" which might somewhat limit TLAB growth and might
> cause slightly lower allocation throughput (but not necessarily; run the
> non-G1 collector with +PrintTLABStatistics to get some data on TLAB 
> sizes).
> However, you can try to fix that by choosing a larger heap region
> size in G1. (G1 cognoscenti on the list can provide more details.)
>
> However, I see that you are interested not just in allocation performance
> but latency of yr operations in general (which is why you were concerned
> with GC pause times themselves). In that case, you are right that CMS or
> G1 would probably be superior if you had a large heap footprint so as to
> get whole heap GC's (or at least if enough objects got promoted to old
> gen so as to require the occasional full gc). Between G1 and CMS,
> G1 generally provides much more regular and predictable GC pauses,
> but for a truly apples-to-apples comparison you cannot assume that the
> optimal heap shape for CMS is the same as that for G1. CMS needs
> hand tuning and G1 finds something close to optimal (but might 
> occasionally
> need some help) -- thus paradoxically, unless you have been careful
> setting the heap shape for G1 might be suboptimal, especially if you
> merely took the CMS optimal setting and used it with G1.
>
> If you can arrange to have nothing promoted into the
> old gen ever then, provided the lifetimes of objects are not too long
> (so you would spend effort copying between survivor spaces), 
> Parallel+NUMA
> may be best, modulo the caveats about NUMA-allocator anti-patterns above.
>
> One final remark: rather than looking at averages or other central 
> measures,
> i'd suggest looking at latency distribution metrics to quickly get a 
> handle
> on what is happening, and how to tune/configure which collector for 
> your needs.
>
> -- ramki
>
> On 04/13/10 10:46, Matt Khan wrote:
>> Hi
>>
>> I have been revisiting our jvm configuration with the aim of reducing 
>> pause times, it would be nice to be consistently down below 3ms all 
>> the time. The allocation behaviour of the application in question 
>> involves a small amount of static data on startup & then a steady 
>> stream of objects that have a relatively short lifespan. There are 2 
>> typical lifetimes of these objects with about 75% while the remainder 
>> have a mean of maybe 70s but there is a quite a long tail to this so 
>> the typical lifetime is more like <10s. There won't be many such 
>> objects alive at once but there are quite a few passing through. The 
>> app runs on a 16 core opteron box running Solaris 10 with 6u18.
>>
>> Therefore I've been benching different configurations with a massive 
>> eden and relatively tiny tenured & trying different collectors to see 
>> how they perform. These params were common to each run
>>
>> -Xms3072m -Xmx3072m -Xmn2944m -XX:+DisableExplicitGC 
>> -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
>> -XX:+PrintGCApplicationStoppedTime
>> -XX:+PrintGCApplicationConcurrentTime
>> -XX:MaxTenuringThreshold=1 -XX:SurvivorRatio=190 
>> -XX:TargetSurvivorRatio=90
>>
>> I then tried the following
>>
>> # Parallel Scavenge -XX:+UseParallelGC -XX:+UseParallelOldGC
>> # Parallel Scavenge with NUMA
>> -XX:+UseParallelGC -XX:+UseNUMA -XX:+UseParallelOldGC
>> # Incremental CMS/ParNew
>> -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode 
>> -XX:+CMSIncrementalPacing -XX:+UseParNewGC
>> # G1
>> -XX:+UnlockExperimentalVMOptions -XX:+UseG1GC
>> The last two (CMS/G1) were repeated on 6u18 & 6u20b02 for 
>> completeness as I see there were assorted fixes to G1 in 6u20b01.
>>
>> I measure the time it takes to execute assorted points in my flow & 
>> see fairly significant differences in latencies with each collector, 
>> for example
>>
>> 1) CMS == ~380-400micros 2) Parallel + NUMA == ~400micros
>> 3) Parallel == ~450micros
>> 4) G1 == ~550micros
>>
>> The times above are taken well after the jvm has warmed up (latencies 
>> have stabilised, compilation activity is practically non-existent) & 
>> there is no significant "other" activity on the server at the time. 
>> The differences don't appear to be pause related as the shape of the 
>> distribution (around those averages) is the same, it's as if it has 
>> settled into quite a different steady state performance. This appears 
>> to be repeatable though, given the time it takes to run this sort of 
>> benchmark, I admit to only have seen it repeated a few times. I have 
>> run previous benchmarks where it repeats it 20x times (keeping GC 
>> constant in this case, was testing something else) without seeing 
>> variations that big across runs which makes me suspect the collection 
>> algorithm as the culprit.
>>
>> So the point of this relatively long setup is to ask whether there 
>> are theoretical reasons why the choice of garbage collection 
>> algorithm should vary measured latency like this? I had been working 
>> on the assumption that eden allocation is a "bump the pointer as you 
>> take it from a TLAB" type of event hence generally cheap & doesn't 
>> really vary by algorithm.
>>
>> fwiw the ParNew/CMS config is still the best one for keeping down 
>> pause times though the parallel one was close. The former peaks at 
>> intermittent pauses of 20-30ms, the latter at about 40ms. The 
>> Parallel + NUMA one curiously involved many fewer pauses such that 
>> much less time was spent paused but peaked higher (~120ms) which are 
>> unacceptable really. I don't really understand why that is but 
>> speculated that it's down to the fact that one of our key domain 
>> objects is allocated in a different thread to where it is primarily 
>> used. Is this right?
>>
>> If there is some other data that I should post to back up some of the 
>> above then pls tell me and I'll add the info if I have it (and repeat 
>> the test if I don't)
>> Cheers
>> Matt
>>
>> Matt Khan
>> --------------------------------------------------
>> GFFX Auto Trading
>> Deutsche Bank, London
>>
>>
>> ---
>>
>> This e-mail may contain confidential and/or privileged information. 
>> If you are not the intended recipient (or have received this e-mail 
>> in error) please notify the sender immediately and delete this 
>> e-mail. Any unauthorized copying, disclosure or distribution of the 
>> material in this e-mail is strictly forbidden.
>>
>> Please refer to http://www.db.com/en/content/eu_disclosures.htm for 
>> additional EU corporate and regulatory disclosures.
>
>