JVM64bit running on Linux 64bit: when system time changes the JVM may hang (bug_id=6900441)

David Holmes david.holmes at oracle.com
Mon Sep 2 15:51:44 PDT 2013


BTW this is all summed up in my eval of 6311057 from 2009:

http://bugs.sun.com/view_bug.do?bug_id=6311057

David

On 3/09/2013 8:13 AM, David Holmes wrote:
> On 3/09/2013 12:02 AM, bruno bossola wrote:
>> Hi David,
>>
>> thanks for your answer, it clarifies the matter.
>>
>>
>>  > My own thoughts are that something has been "fixed" in the 64-bit
>> linux kernel
>>  > and that this now exposes this issue where previously it did not.
>>  >
>> If I recall the matter correctly the Linux kernel was not always
>> behaving and for that reason the JVM was double-checking outside using
>> timeofday. This "fix" is now affecting the correct behaviour
>
> No I don't think so. There have been a lot of bugs in this area. One
> issue we tried to address was the "early return from sleep/wait", by
> checking actual elapsed time rather than trusting the timed routines.
> But that only dealt with forward time jumps.
>
> As I said in other email I don't yet know what exactly has changed in
> the 64-bit kernel to expose this issue.
>
>>  > The fix is quite straight-forward, assuming the kernel does the right
>> thing - and
>>  > that is to use pthread_cond_t associated with CLOCK_MONOTONIC, or even
>>  > better CLOCK_MONOTONIC_RAW.
>>  > But there is a complexity in the park code because that API allows
>> both relative
>>  > and absolute timeouts and for the absolute case we would have to use
>> a different
>>  > condition variable to wait on (one using CLOCK_REALTIME as it should
>> be affected
>>  > by changes to the clock!).
>>  >
>> Looks like an if() to me: it should the old code when absolute and the
>> fixed code when relative. Am I missing something?
>
> There have to be two different condition variables associated with the
> same mutex, that form the combined park/unpark implementation. If you do
> a relative timed park you wait on one, if an absolute timed park then
> you wait on the other - they each use different clocks. The unpark code
> has to know which one to signal or redundantly signal both. Not terribly
> hard just more complex than simply switching a clock.
>
>>  > I can raise the priority of this but a fix for 8 may not be feasible
>> given the current state of things
>>  >
>> That's really unfortunate. My guess is that if this thing goes public or
>> viral Java will be in big trouble. With the "lens" of VP of Engineering
>> in my company I am really considering alternatives.
>>
>>
>> Do you have any workaround to suggest? Can you send me some code I/my
>> team can try to patch the native libraries?
>
> I started to prototype the fix for this years ago. I'll see if I can
> revive the webrev. The basic change to use CLOCK_MONOTONIC is not hard.
> Using CLOCK_MONOTONIC_RAW may be harder (we haven't switch to that yet
> because our official build platforms haven't supported it).
>
> I'll see what I can put up. But note this is not part of my day job.
>
> Cheers,
> David
>
>
>> Cheers,
>>
>>      Bruno
>>
>>
>>
>> On Mon, Sep 2, 2013 at 1:59 PM, David Holmes <david.holmes at oracle.com
>> <mailto:david.holmes at oracle.com>> wrote:
>>
>>     Hi Bruno,
>>
>>     As you note this is a very old issue. The reason it hasn't become a
>>     priority to fix was because it didn't actually manifest. In theory
>>     it should but in practice some "incorrect" clock handling in the
>>     kernel made everything work okay. Jump forward to now and we have
>>     already seen reports where this has become a problem on 64-bit but
>>     still works okay on 32-bit - which is very puzzling as in theory
>>     there should be no difference. My own thoughts are that something
>>     has been "fixed" in the 64-bit linux kernel and that this now
>>     exposes this issue where previously it did not.
>>
>>     The basic sleep/wait/park with relative timeouts all use the same
>>     underlying mechanism on linux: pthread_cond_timedwait. This takes an
>>     absolute time which is currently based on CLOCK_REALTIME. So in
>>     theory if the clock is set forward the waits will complete earlier;
>>     and if set back they will complete later. But note this is not what
>>     was observed in practice.
>>
>>     The fix is quite straight-forward, assuming the kernel does the
>>     right thing - and that is to use pthread_cond_t associated with
>>     CLOCK_MONOTONIC, or even better CLOCK_MONOTONIC_RAW.
>>
>>     But there is a complexity in the park code because that API allows
>>     both relative and absolute timeouts and for the absolute case we
>>     would have to use a different condition variable to wait on (one
>>     using CLOCK_REALTIME as it should be affected by changes to the
>> clock!).
>>
>>     I can raise the priority of this but a fix for 8 may not be feasible
>>     given the current state of things.
>>
>>     David Holmes
>>
>>
>>     On 2/09/2013 9:41 PM, bruno bossola wrote:
>>
>>         Hi all,
>>
>>         I am posting here after few message exchange on the LJC mailing
>>         list,
>>         from the 7u lead:
>>
>>         ===================
>>         Looks like an old/known issue. I've seen varying reports around
>>         whether
>>         this is a linux kernel issue or jvm issue.
>>         I'd suggest that Bruno follows up with a question on the
>>         hotspot-runtime-dev at openjdk.__java.net
>>         <mailto:hotspot-runtime-dev at openjdk.java.net>
>>         <mailto:hotspot-runtime-dev at __openjdk.java.net
>>         <mailto:hotspot-runtime-dev at openjdk.java.net>> mailing list [...]
>>
>>         ====================
>>
>>         In these days my teams are hitting a bug on the JVM 64bit on
>> Linux
>>         64bit: "...there is bug in JVM for overall scheduling during
>>         Sytem time
>>         changes backward, which also impacts very basic Object.wait &
>>         Thread.sleep methods. It becomes too risky to keep Java App
>>         running when
>>         system time switches back by even certain seconds. You never
>>         know what
>>         your Java App will end up to." (source: stackoverflow.com
>>         <http://stackoverflow.com>
>>         <http://stackoverflow.com>)
>>
>>
>>         These are some of the consequences:
>>         http://bugs.sun.com/__bugdatabase/view_bug.do?bug___id=7139684
>>         <http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7139684>
>>         http://bugs.sun.com/__bugdatabase/view_bug.do?bug___id=6311057
>>         <http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6311057>:
>>         http://bugs.sun.com/__bugdatabase/view_bug.do?bug___id=7139684
>>         <http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7139684>
>>
>>         The original bug is private, but I was told it's a P4 that
>>         unfortunately
>>         it's not looked after and gets simply shifted from this release
>>         to the
>>         next one
>>         http://bugs.sun.com/__bugdatabase/view_bug.do?bug___id=6900441
>>         <http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6900441>
>>
>>         See also here for a stackoverflow drill:
>>
>> http://stackoverflow.com/__questions/9044423/java-__scheduler-which-is-completely-__independent-of-system-time-__changes
>>
>>
>> <http://stackoverflow.com/questions/9044423/java-scheduler-which-is-completely-independent-of-system-time-changes>
>>
>>
>>         Such bug is NOT fixed in the latest JVM, so the recommended
>>           course of
>>         action is to restart the VM if a bit time jump happens (on small
>>         jumps
>>         the JVM will catch up). This is consistently happening on a
>>         64bitvm when
>>         used on a 64bit linux system, regardless of the monotonicity
>> of the
>>         underlying OS (at least apparently).
>>
>>         Note that this should not happen for primitives such as
>>         System.nanoTime() (like the queue used internally for
>>         ScheduledExecutor)
>>         that should work correctly in presence of a monotonic system:
>>
>>         jlong os::javaTimeNanos() {
>>             if (Linux::supports_monotonic___clock()) {
>>               struct timespec tp;
>>               int status = Linux::clock_gettime(CLOCK___MONOTONIC, &tp);
>>               assert(status == 0, "gettime error");
>>               jlong result = jlong(tp.tv_sec) * (1000 * 1000 * 1000) +
>>         jlong(tp.tv_nsec);
>>               return result;
>>             } else {
>>               timeval time;
>>               int status = gettimeofday(&time, NULL);
>>               assert(status != -1, "linux error");
>>               jlong usecs = jlong(time.tv_sec) * (1000 * 1000) +
>>         jlong(time.tv_usec);
>>               return 1000 * usecs;
>>             }
>>         }
>>
>>         Unfortunately, for some reasons, this is not the case on 1.6+
>>         64bitVM on
>>         64bitLinux. Furthermore, to be more clear about the issue, the
>>         extent of
>>         it and the concurrency library, let me introduce this very
>>         simple program:
>>
>>         import java.util.concurrent.locks.__LockSupport;
>>
>>         public class Main {
>>
>>               public static void main(String[] args) {
>>
>>                   for (int i=100; i>0; i--) {
>>                       System.out.println(i);
>>                       LockSupport.parkNanos(1000L*__1000L*1000L);
>>                   }
>>
>>                   System.out.println("Done!");
>>               }
>>         }
>>
>>         While running it with a 64bit 1.6+ JVM on 64bit Linux, turn the
>>         clock
>>         down one hour and wait until the counter stops... magic!  I
>>         tested this
>>         on JDK6, JDK7 and latest JDK8 beta running on various Ubuntu
>>         distros.
>>         It's not just a matter of (old?) sleep() and wait() primitives,
>>         it also
>>         affects the new concurrency library. Please note that classic
>>         sleep()
>>         works correctly on JDK1.4: it qualifies this bug as a regression
>>         to me,
>>         and the fact that it's there since at least 7 years kind of
>>         troubles me.
>>
>>         This is something we cannot easily manage as our software is
>>         installed
>>         on-premises to our customers, hence we have no control at all
>>         about time
>>         changes: if our application hangs, we are pretty much in big
>>         trouble.
>>
>>         I'd really like to get your view on the matter.
>>
>>         Thanks in advance,
>>
>>               Bruno
>>
>>
>>


More information about the hotspot-runtime-dev mailing list