Solaris specific bug in GC log rotation - extremely long pause
Rainer Jung
rainer.jung at kippdata.de
Tue Nov 10 15:45:49 UTC 2015
There's a bug in the Solaris specific HotSpot code which exhibits itself
as extremely long pauses directly after GC log rotation.
The bug can be seen in 1.7.0 patchlevel 80 and above, 1.8.0 patchlevel
20 and above and 1.9.0. It was noticed using
-XX:PrintGCApplicationStoppedTime.
During GC log rotation the function os::print_memory_info() in
src/os/solaris/vm/os_solaris.cpp is called, which itself calls
check_addr0().
Now check_addr0() reads the file /proc/self/pmap, but does this entry by
entry. Productive Java processes often have 1000 and more map entries
(e.g. each thread stack is one entry, often each jar etc.
Reading one entry takes about 35-40ms in a zone on a T4 system. Reading
e.g. 1000 entries takes 35-40 seconds during which the JVM is stopped.
We experienced even stop times up until 2400 seconds in extreme cases.
Note that this doesn't have to do with GC. It only happens directly
after GC log file rotation.
The bug is reading the file entry by entry instead of using pread() to
read it in one go. Look for instance at the source code of the Solaris
command "pmap" which uses pread() to read the file at once.
It might also be questioned, whether this map file check must happen
during a safepoint or could also be done outside, only synchronizing in
case a hit was found and a warning actually needs to be written to the
log file.
The bug was shadowed by another bug until and including 1.7.0_76 and
1.8.0_11: the close() that closes the map file after reading the entries
was erroneously put inside the read loop instead of after the read loop.
Thus these versions only read the first entry, then closed the file and
erred out when trying to read the next entry. So reading was quick.
The responsible change was:
Java 8:
http://hg.openjdk.java.net/jdk8u/jdk8u/hotspot/diff/7d28f4e15b61/src/os/solaris/vm/os_solaris.cpp
Java 7:
http://hg.openjdk.java.net/jdk7u/jdk7u/hotspot/diff/e50eb3195734/src/os/solaris/vm/os_solaris.cpp
Example code for reading using pread() (going back to OpenSolaris):
https://github.com/joyent/illumos-joyent/blob/master/usr/src/cmd/ptools/pmap/pmap.c
Look e.g. at function xmapping_iter() and there look for pread().
See also the discussion thread starting at
http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2015-November/002351.html.
The analysis was done using source code review and realtime checks with
pstack and truss. It would be nice if you could instruct me how to get
this fixed. It is a bit specific because it only happens on Solaris and
when GC log rotation is active, but Solaris should be an important
platform for Oracle and GC log rotation will be used more and more.
I do *not* have a support contract.
Regards,
Rainer
More information about the hotspot-gc-dev
mailing list