JVM64bit running on Linux 64bit: when system time changes the JVM may hang (bug_id=6900441)

bruno bossola bbossola at gmail.com
Mon Sep 2 07:20:06 PDT 2013


I cannot see this as a kernel bug, sorry, as the kernel is (now?) behaving
as expected.  it looks that, for various reason, the JVM is not using a
monotonic clock for nanos wait, and under certain conditions a lot of
primitive functions become not reliable.

For example this function:
http://docs.oracle.com/javase/1.5.0/docs/api/java/util/concurrent/locks/LockSupport.html#parkNanos%28long%29
becomes not reliable, and because of that the whole concurrency lot stops
working.

Cheers,

    Bruno


On Mon, Sep 2, 2013 at 3:01 PM, Andrew Haley <aph at redhat.com> wrote:

> Hi,
>
> Yabbut, we really can't rely on a kernel bug to get correct behaviour.
> If the kernel is doing what it's supposed to do then this behaviour is
> exactly what should happen.
>
> Andrew.
>
>
> On 09/02/2013 01:59 PM, David Holmes wrote:
> > Hi Bruno,
> >
> > As you note this is a very old issue. The reason it hasn't become a
> > priority to fix was because it didn't actually manifest. In theory it
> > should but in practice some "incorrect" clock handling in the kernel
> > made everything work okay. Jump forward to now and we have already seen
> > reports where this has become a problem on 64-bit but still works okay
> > on 32-bit - which is very puzzling as in theory there should be no
> > difference. My own thoughts are that something has been "fixed" in the
> > 64-bit linux kernel and that this now exposes this issue where
> > previously it did not.
> >
> > The basic sleep/wait/park with relative timeouts all use the same
> > underlying mechanism on linux: pthread_cond_timedwait. This takes an
> > absolute time which is currently based on CLOCK_REALTIME. So in theory
> > if the clock is set forward the waits will complete earlier; and if set
> > back they will complete later. But note this is not what was observed in
> > practice.
> >
> > The fix is quite straight-forward, assuming the kernel does the right
> > thing - and that is to use pthread_cond_t associated with
> > CLOCK_MONOTONIC, or even better CLOCK_MONOTONIC_RAW.
> >
> > But there is a complexity in the park code because that API allows both
> > relative and absolute timeouts and for the absolute case we would have
> > to use a different condition variable to wait on (one using
> > CLOCK_REALTIME as it should be affected by changes to the clock!).
> >
> > I can raise the priority of this but a fix for 8 may not be feasible
> > given the current state of things.
> >
> > David Holmes
> >
> > On 2/09/2013 9:41 PM, bruno bossola wrote:
> >> Hi all,
> >>
> >> I am posting here after few message exchange on the LJC mailing list,
> >> from the 7u lead:
> >>
> >> ===================
> >> Looks like an old/known issue. I've seen varying reports around whether
> >> this is a linux kernel issue or jvm issue.
> >> I'd suggest that Bruno follows up with a question on the
> >> hotspot-runtime-dev at openjdk.java.net
> >> <mailto:hotspot-runtime-dev at openjdk.java.net> mailing list [...]
> >> ====================
> >>
> >> In these days my teams are hitting a bug on the JVM 64bit on Linux
> >> 64bit: "...there is bug in JVM for overall scheduling during Sytem time
> >> changes backward, which also impacts very basic Object.wait &
> >> Thread.sleep methods. It becomes too risky to keep Java App running when
> >> system time switches back by even certain seconds. You never know what
> >> your Java App will end up to." (source: stackoverflow.com
> >> <http://stackoverflow.com>)
> >>
> >> These are some of the consequences:
> >> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7139684
> >> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6311057:
> >> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7139684
> >>
> >> The original bug is private, but I was told it's a P4 that unfortunately
> >> it's not looked after and gets simply shifted from this release to the
> >> next one
> >> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6900441
> >>
> >> See also here for a stackoverflow drill:
> >>
> http://stackoverflow.com/questions/9044423/java-scheduler-which-is-completely-independent-of-system-time-changes
> >>
> >> Such bug is NOT fixed in the latest JVM, so the recommended  course of
> >> action is to restart the VM if a bit time jump happens (on small jumps
> >> the JVM will catch up). This is consistently happening on a 64bitvm when
> >> used on a 64bit linux system, regardless of the monotonicity of the
> >> underlying OS (at least apparently).
> >>
> >> Note that this should not happen for primitives such as
> >> System.nanoTime() (like the queue used internally for ScheduledExecutor)
> >> that should work correctly in presence of a monotonic system:
> >>
> >> jlong os::javaTimeNanos() {
> >>    if (Linux::supports_monotonic_clock()) {
> >>      struct timespec tp;
> >>      int status = Linux::clock_gettime(CLOCK_MONOTONIC, &tp);
> >>      assert(status == 0, "gettime error");
> >>      jlong result = jlong(tp.tv_sec) * (1000 * 1000 * 1000) +
> >> jlong(tp.tv_nsec);
> >>      return result;
> >>    } else {
> >>      timeval time;
> >>      int status = gettimeofday(&time, NULL);
> >>      assert(status != -1, "linux error");
> >>      jlong usecs = jlong(time.tv_sec) * (1000 * 1000) +
> jlong(time.tv_usec);
> >>      return 1000 * usecs;
> >>    }
> >> }
> >>
> >> Unfortunately, for some reasons, this is not the case on 1.6+ 64bitVM on
> >> 64bitLinux. Furthermore, to be more clear about the issue, the extent of
> >> it and the concurrency library, let me introduce this very simple
> program:
> >>
> >> import java.util.concurrent.locks.LockSupport;
> >>
> >> public class Main {
> >>
> >>      public static void main(String[] args) {
> >>
> >>          for (int i=100; i>0; i--) {
> >>              System.out.println(i);
> >>              LockSupport.parkNanos(1000L*1000L*1000L);
> >>          }
> >>
> >>          System.out.println("Done!");
> >>      }
> >> }
> >>
> >> While running it with a 64bit 1.6+ JVM on 64bit Linux, turn the clock
> >> down one hour and wait until the counter stops... magic!  I tested this
> >> on JDK6, JDK7 and latest JDK8 beta running on various Ubuntu distros.
> >> It's not just a matter of (old?) sleep() and wait() primitives, it also
> >> affects the new concurrency library. Please note that classic sleep()
> >> works correctly on JDK1.4: it qualifies this bug as a regression to me,
> >> and the fact that it's there since at least 7 years kind of troubles me.
> >>
> >> This is something we cannot easily manage as our software is installed
> >> on-premises to our customers, hence we have no control at all about time
> >> changes: if our application hangs, we are pretty much in big trouble.
> >>
> >> I'd really like to get your view on the matter.
> >>
> >> Thanks in advance,
> >>
> >>      Bruno
> >>
> >>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/attachments/20130902/384630c0/attachment.html 


More information about the hotspot-runtime-dev mailing list