JVM64bit running on Linux 64bit: when system time changes the JVM may hang (bug_id=6900441)
David Holmes
david.holmes at oracle.com
Mon Sep 2 15:13:23 PDT 2013
On 3/09/2013 12:02 AM, bruno bossola wrote:
> Hi David,
>
> thanks for your answer, it clarifies the matter.
>
>
> > My own thoughts are that something has been "fixed" in the 64-bit
> linux kernel
> > and that this now exposes this issue where previously it did not.
> >
> If I recall the matter correctly the Linux kernel was not always
> behaving and for that reason the JVM was double-checking outside using
> timeofday. This "fix" is now affecting the correct behaviour
No I don't think so. There have been a lot of bugs in this area. One
issue we tried to address was the "early return from sleep/wait", by
checking actual elapsed time rather than trusting the timed routines.
But that only dealt with forward time jumps.
As I said in other email I don't yet know what exactly has changed in
the 64-bit kernel to expose this issue.
> > The fix is quite straight-forward, assuming the kernel does the right
> thing - and
> > that is to use pthread_cond_t associated with CLOCK_MONOTONIC, or even
> > better CLOCK_MONOTONIC_RAW.
> > But there is a complexity in the park code because that API allows
> both relative
> > and absolute timeouts and for the absolute case we would have to use
> a different
> > condition variable to wait on (one using CLOCK_REALTIME as it should
> be affected
> > by changes to the clock!).
> >
> Looks like an if() to me: it should the old code when absolute and the
> fixed code when relative. Am I missing something?
There have to be two different condition variables associated with the
same mutex, that form the combined park/unpark implementation. If you do
a relative timed park you wait on one, if an absolute timed park then
you wait on the other - they each use different clocks. The unpark code
has to know which one to signal or redundantly signal both. Not terribly
hard just more complex than simply switching a clock.
> > I can raise the priority of this but a fix for 8 may not be feasible
> given the current state of things
> >
> That's really unfortunate. My guess is that if this thing goes public or
> viral Java will be in big trouble. With the "lens" of VP of Engineering
> in my company I am really considering alternatives.
>
>
> Do you have any workaround to suggest? Can you send me some code I/my
> team can try to patch the native libraries?
I started to prototype the fix for this years ago. I'll see if I can
revive the webrev. The basic change to use CLOCK_MONOTONIC is not hard.
Using CLOCK_MONOTONIC_RAW may be harder (we haven't switch to that yet
because our official build platforms haven't supported it).
I'll see what I can put up. But note this is not part of my day job.
Cheers,
David
> Cheers,
>
> Bruno
>
>
>
> On Mon, Sep 2, 2013 at 1:59 PM, David Holmes <david.holmes at oracle.com
> <mailto:david.holmes at oracle.com>> wrote:
>
> Hi Bruno,
>
> As you note this is a very old issue. The reason it hasn't become a
> priority to fix was because it didn't actually manifest. In theory
> it should but in practice some "incorrect" clock handling in the
> kernel made everything work okay. Jump forward to now and we have
> already seen reports where this has become a problem on 64-bit but
> still works okay on 32-bit - which is very puzzling as in theory
> there should be no difference. My own thoughts are that something
> has been "fixed" in the 64-bit linux kernel and that this now
> exposes this issue where previously it did not.
>
> The basic sleep/wait/park with relative timeouts all use the same
> underlying mechanism on linux: pthread_cond_timedwait. This takes an
> absolute time which is currently based on CLOCK_REALTIME. So in
> theory if the clock is set forward the waits will complete earlier;
> and if set back they will complete later. But note this is not what
> was observed in practice.
>
> The fix is quite straight-forward, assuming the kernel does the
> right thing - and that is to use pthread_cond_t associated with
> CLOCK_MONOTONIC, or even better CLOCK_MONOTONIC_RAW.
>
> But there is a complexity in the park code because that API allows
> both relative and absolute timeouts and for the absolute case we
> would have to use a different condition variable to wait on (one
> using CLOCK_REALTIME as it should be affected by changes to the clock!).
>
> I can raise the priority of this but a fix for 8 may not be feasible
> given the current state of things.
>
> David Holmes
>
>
> On 2/09/2013 9:41 PM, bruno bossola wrote:
>
> Hi all,
>
> I am posting here after few message exchange on the LJC mailing
> list,
> from the 7u lead:
>
> ===================
> Looks like an old/known issue. I've seen varying reports around
> whether
> this is a linux kernel issue or jvm issue.
> I'd suggest that Bruno follows up with a question on the
> hotspot-runtime-dev at openjdk.__java.net
> <mailto:hotspot-runtime-dev at openjdk.java.net>
> <mailto:hotspot-runtime-dev at __openjdk.java.net
> <mailto:hotspot-runtime-dev at openjdk.java.net>> mailing list [...]
>
> ====================
>
> In these days my teams are hitting a bug on the JVM 64bit on Linux
> 64bit: "...there is bug in JVM for overall scheduling during
> Sytem time
> changes backward, which also impacts very basic Object.wait &
> Thread.sleep methods. It becomes too risky to keep Java App
> running when
> system time switches back by even certain seconds. You never
> know what
> your Java App will end up to." (source: stackoverflow.com
> <http://stackoverflow.com>
> <http://stackoverflow.com>)
>
>
> These are some of the consequences:
> http://bugs.sun.com/__bugdatabase/view_bug.do?bug___id=7139684
> <http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7139684>
> http://bugs.sun.com/__bugdatabase/view_bug.do?bug___id=6311057
> <http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6311057>:
> http://bugs.sun.com/__bugdatabase/view_bug.do?bug___id=7139684
> <http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7139684>
>
> The original bug is private, but I was told it's a P4 that
> unfortunately
> it's not looked after and gets simply shifted from this release
> to the
> next one
> http://bugs.sun.com/__bugdatabase/view_bug.do?bug___id=6900441
> <http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6900441>
>
> See also here for a stackoverflow drill:
> http://stackoverflow.com/__questions/9044423/java-__scheduler-which-is-completely-__independent-of-system-time-__changes
> <http://stackoverflow.com/questions/9044423/java-scheduler-which-is-completely-independent-of-system-time-changes>
>
> Such bug is NOT fixed in the latest JVM, so the recommended
> course of
> action is to restart the VM if a bit time jump happens (on small
> jumps
> the JVM will catch up). This is consistently happening on a
> 64bitvm when
> used on a 64bit linux system, regardless of the monotonicity of the
> underlying OS (at least apparently).
>
> Note that this should not happen for primitives such as
> System.nanoTime() (like the queue used internally for
> ScheduledExecutor)
> that should work correctly in presence of a monotonic system:
>
> jlong os::javaTimeNanos() {
> if (Linux::supports_monotonic___clock()) {
> struct timespec tp;
> int status = Linux::clock_gettime(CLOCK___MONOTONIC, &tp);
> assert(status == 0, "gettime error");
> jlong result = jlong(tp.tv_sec) * (1000 * 1000 * 1000) +
> jlong(tp.tv_nsec);
> return result;
> } else {
> timeval time;
> int status = gettimeofday(&time, NULL);
> assert(status != -1, "linux error");
> jlong usecs = jlong(time.tv_sec) * (1000 * 1000) +
> jlong(time.tv_usec);
> return 1000 * usecs;
> }
> }
>
> Unfortunately, for some reasons, this is not the case on 1.6+
> 64bitVM on
> 64bitLinux. Furthermore, to be more clear about the issue, the
> extent of
> it and the concurrency library, let me introduce this very
> simple program:
>
> import java.util.concurrent.locks.__LockSupport;
>
> public class Main {
>
> public static void main(String[] args) {
>
> for (int i=100; i>0; i--) {
> System.out.println(i);
> LockSupport.parkNanos(1000L*__1000L*1000L);
> }
>
> System.out.println("Done!");
> }
> }
>
> While running it with a 64bit 1.6+ JVM on 64bit Linux, turn the
> clock
> down one hour and wait until the counter stops... magic! I
> tested this
> on JDK6, JDK7 and latest JDK8 beta running on various Ubuntu
> distros.
> It's not just a matter of (old?) sleep() and wait() primitives,
> it also
> affects the new concurrency library. Please note that classic
> sleep()
> works correctly on JDK1.4: it qualifies this bug as a regression
> to me,
> and the fact that it's there since at least 7 years kind of
> troubles me.
>
> This is something we cannot easily manage as our software is
> installed
> on-premises to our customers, hence we have no control at all
> about time
> changes: if our application hangs, we are pretty much in big
> trouble.
>
> I'd really like to get your view on the matter.
>
> Thanks in advance,
>
> Bruno
>
>
>
More information about the hotspot-runtime-dev
mailing list