Reducing CMS-remark times

Thu Sep 11 19:25:20 UTC 2008

Hi Justin --

> Here's my current trains of thought:
> 
> a) My load generator is not very close to real-world load.

Probably :)

> b) There are some OS level tunables that need set on the v490

I don't think those would account for the diff. I know of no
specific OS tunables in this case.

> c) There is a bug in 1.4.2_17 that's biting me.

Unlikely.

> d) I'm not getting concurrency in the remark phase (which would
> explain my dual core laptop keeping up with my 8 core server)

The remark phase has never been "concurrent" (at least not in the sense
of being concurrent with mutators); instead it's "stop-the-world".
The application threads are all stopped while the remark phase proceeds,
which is why long remark pauses hurt a lot.

> d) I'm running into what Jon is describing here:
> http://blogs.sun.com/jonthecollector/entry/did_you_know regarding
> CMSMaxAbortablePrecleanTime.  I have no idea how I can resolve this on
> 1.4.2 if that's the case.  Perhaps shrink my New???

Right. The best way to see evidence of this is to use -XX:PrintCMSStatistics=1.
You will, in the remark phase, see details on how much time is spent in
which worker thread, and you'll notice that one of the threads takes extremely
long. This is the thread scanning a large monolithic Eden, which becomes
a serial bottleneck for this phase.

The parallelization of Eden scanning was implemented in 5.0, so
your best bet is to upgrade to a newer jvm if possible, and failing
that to use a smaller young gen.

Another optional tuning switch introduced in 5uXX was +CMSScavengeBeforeRemark
which does a scavenge which empties Eden imediately before doing a
remark. This makes what was the critical task here into a zero-length
task, and converts that work into more dirty cards in the old gen which
are scanned in parallel, leading to a reduction in the pause time.
Unfortunately, that's also only available in 5uXX for the first time.

All the extra processors on yr v490 cannot help for a large serial task
in this case (1.4.2_XX).

That said, if you compare the remark phases for the two snippets you give
also indicate that yr production system also sees a large "weak reference"
processing time, which is not the case on yr laptop.
I suspect that might be the result of a difference between yr production
load and yr synthetic load on the laptop. This again is a serial phase
by default that the extra cpus on the v490 do not help on.

This phase was optimized some on 5uXX (by adding a parallel reference
processing option), and further in 6uXX (by enhancing precleaning of
discovered references). I don;t believe either of those optimizations
was ever backported to 1.4.2_XX.

1.4.2_XX is really quite old and it shows its age, wrt modern
servers with much parallelism.

-- ramki

> Production server:
> ....

>
> 18999.700: [GC18999.700: [Rescan (parallel) , 1.0845970
> secs]19000.785: [weak refs processing, 1.0721075 secs] [1 CMS-remark:
> 1610403K(1966080K)] 1894660K(2310144K), 2.1578777 secs]

> Laptop:

> 12457.363: [GC12457.363: [Rescan (parallel) , 1.9972650
> secs]12459.360: [weak refs processing, 0.1239550 secs] [1 CMS-remark:
> 1620532K(1966080K)] 1941855K(2310144K), 2.1214010 secs]
_______________________________________________
hotspot-gc-use mailing list
hotspot-gc-use at openjdk.java.net
http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use