Troubleshooting a ~40-second minor collection

Mon Dec 2 15:19:32 PST 2013

If you know the problem is happening and you are logged into the machine you can try doing "perf top" or "perf record" during the problem time and see the kernel profile of what is happening. I think you would also want to do it when the problem is not happening to have something to compare to.
Also dstat is handy in these situations to see the cpu,  i/o, interrupts etc as time goes by.
 http://dag.wiee.rs/home-made/dstat/

Regards,
Eric

-----Original Message-----
From: hotspot-gc-use-bounces at openjdk.java.net [mailto:hotspot-gc-use-bounces at openjdk.java.net] On Behalf Of Bernd Eckenfels
Sent: Monday, December 02, 2013 5:53 PM
To: hotspot-gc-use at openjdk.java.net
Subject: Re: Troubleshooting a ~40-second minor collection

Hello,

Hmm, switching numa off in linux boot means that there is no numa optimization done by the kernel, the hardware is still numa (so the system most likely behaves worse as there is no optimization for local memory regions). (Maybe it would help to bind the java process to only one node (odes CPUs and memory)). This is especially good if some other stuff (other JVM or DB) can be bound to the other node.

As for your question on how to go with it, I would check for a large number of hardware interrupts (hi,%irq) or context switches (compared to idle times and %soft interrupts), not so sure if there is an easy way to see if interrupt optimizations are active/needed by the drivers. (mpstat -P ALL, vmstat, /proc/interrupts). I havent been into hardware lately, but I would say >2k cs/s is something to observe closer.

For network cards for example ethtool can be used to tune it (see for example http://serverfault.com/questions/241421/napi-vs-adaptive-interrupts). But I guess it is only a problem when you have mulitple GE interfaces (or faster).

Gruss
Bernd

Am 02.12.2013, 23:26 Uhr, schrieb Aaron Daubman <daubman at gmail.com>:

> Hi Bernd,
>
> Thanks for the info.
> This is a numa machine, however, as part of setting up hugepages, I 
> have disabled numa (numa=off) in grub.conf (and have also disabled 
> transparent huge page support).
>
> The JVM process is the only significant process (aside from the 
> high-rate data copy tar/nc/pigz) running on this 32-core, 2-node 64G 
> RAM box. The tar process is limited to using one CPU (close to 100%) 
> but leaving 31 others free for the JVM - load average on the box is 
> fairly low.
>
> The JVM process is spread fairly evenly over the nodes - watching htop 
> I can see CPU jumping around among the 32 cores.
>
> Do you know what I might look at to see network/disk driver missbehavior?
>
> Thanks!
>     Aaron
>
>
> On Mon, Dec 2, 2013 at 5:21 PM, Bernd Eckenfels
> <bernd-2013 at eckenfels.net>wrote:
>
>> Hello Aaron,
>>
>> another rough guess is, that when you "copy high rate" that you have 
>> a lot of system interrupt time and conext switches (especially when 
>> the network or disk drivers are missbehaving).
>>
>> I wonder if if this can really slow down the GC so much, but it would 
>> be the next thing I would investigate.
>>
>> Is this a NUMA machine? Is the JVM process spread over multiple nodes?
>>
>> Gruss
>> Bernd
>>
>>
>> Am 02.12.2013, 23:14 Uhr, schrieb Aaron Daubman <daubman at gmail.com>:
>>
>> _______________________________________________
>> hotspot-gc-use mailing list
>> hotspot-gc-use at openjdk.java.net
>> http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use
>>

--
http://bernd.eckenfels.net
_______________________________________________
hotspot-gc-use mailing list
hotspot-gc-use at openjdk.java.net
http://mail.openjdk.java.net/mailman/listinfo/hotspot-gc-use