JVM64bit running on Linux 64bit: when system time changes the JVM may hang (bug_id=6900441)
David Holmes
david.holmes at oracle.com
Mon Sep 2 15:51:44 PDT 2013
BTW this is all summed up in my eval of 6311057 from 2009:
http://bugs.sun.com/view_bug.do?bug_id=6311057
David
On 3/09/2013 8:13 AM, David Holmes wrote:
> On 3/09/2013 12:02 AM, bruno bossola wrote:
>> Hi David,
>>
>> thanks for your answer, it clarifies the matter.
>>
>>
>> > My own thoughts are that something has been "fixed" in the 64-bit
>> linux kernel
>> > and that this now exposes this issue where previously it did not.
>> >
>> If I recall the matter correctly the Linux kernel was not always
>> behaving and for that reason the JVM was double-checking outside using
>> timeofday. This "fix" is now affecting the correct behaviour
>
> No I don't think so. There have been a lot of bugs in this area. One
> issue we tried to address was the "early return from sleep/wait", by
> checking actual elapsed time rather than trusting the timed routines.
> But that only dealt with forward time jumps.
>
> As I said in other email I don't yet know what exactly has changed in
> the 64-bit kernel to expose this issue.
>
>> > The fix is quite straight-forward, assuming the kernel does the right
>> thing - and
>> > that is to use pthread_cond_t associated with CLOCK_MONOTONIC, or even
>> > better CLOCK_MONOTONIC_RAW.
>> > But there is a complexity in the park code because that API allows
>> both relative
>> > and absolute timeouts and for the absolute case we would have to use
>> a different
>> > condition variable to wait on (one using CLOCK_REALTIME as it should
>> be affected
>> > by changes to the clock!).
>> >
>> Looks like an if() to me: it should the old code when absolute and the
>> fixed code when relative. Am I missing something?
>
> There have to be two different condition variables associated with the
> same mutex, that form the combined park/unpark implementation. If you do
> a relative timed park you wait on one, if an absolute timed park then
> you wait on the other - they each use different clocks. The unpark code
> has to know which one to signal or redundantly signal both. Not terribly
> hard just more complex than simply switching a clock.
>
>> > I can raise the priority of this but a fix for 8 may not be feasible
>> given the current state of things
>> >
>> That's really unfortunate. My guess is that if this thing goes public or
>> viral Java will be in big trouble. With the "lens" of VP of Engineering
>> in my company I am really considering alternatives.
>>
>>
>> Do you have any workaround to suggest? Can you send me some code I/my
>> team can try to patch the native libraries?
>
> I started to prototype the fix for this years ago. I'll see if I can
> revive the webrev. The basic change to use CLOCK_MONOTONIC is not hard.
> Using CLOCK_MONOTONIC_RAW may be harder (we haven't switch to that yet
> because our official build platforms haven't supported it).
>
> I'll see what I can put up. But note this is not part of my day job.
>
> Cheers,
> David
>
>
>> Cheers,
>>
>> Bruno
>>
>>
>>
>> On Mon, Sep 2, 2013 at 1:59 PM, David Holmes <david.holmes at oracle.com
>> <mailto:david.holmes at oracle.com>> wrote:
>>
>> Hi Bruno,
>>
>> As you note this is a very old issue. The reason it hasn't become a
>> priority to fix was because it didn't actually manifest. In theory
>> it should but in practice some "incorrect" clock handling in the
>> kernel made everything work okay. Jump forward to now and we have
>> already seen reports where this has become a problem on 64-bit but
>> still works okay on 32-bit - which is very puzzling as in theory
>> there should be no difference. My own thoughts are that something
>> has been "fixed" in the 64-bit linux kernel and that this now
>> exposes this issue where previously it did not.
>>
>> The basic sleep/wait/park with relative timeouts all use the same
>> underlying mechanism on linux: pthread_cond_timedwait. This takes an
>> absolute time which is currently based on CLOCK_REALTIME. So in
>> theory if the clock is set forward the waits will complete earlier;
>> and if set back they will complete later. But note this is not what
>> was observed in practice.
>>
>> The fix is quite straight-forward, assuming the kernel does the
>> right thing - and that is to use pthread_cond_t associated with
>> CLOCK_MONOTONIC, or even better CLOCK_MONOTONIC_RAW.
>>
>> But there is a complexity in the park code because that API allows
>> both relative and absolute timeouts and for the absolute case we
>> would have to use a different condition variable to wait on (one
>> using CLOCK_REALTIME as it should be affected by changes to the
>> clock!).
>>
>> I can raise the priority of this but a fix for 8 may not be feasible
>> given the current state of things.
>>
>> David Holmes
>>
>>
>> On 2/09/2013 9:41 PM, bruno bossola wrote:
>>
>> Hi all,
>>
>> I am posting here after few message exchange on the LJC mailing
>> list,
>> from the 7u lead:
>>
>> ===================
>> Looks like an old/known issue. I've seen varying reports around
>> whether
>> this is a linux kernel issue or jvm issue.
>> I'd suggest that Bruno follows up with a question on the
>> hotspot-runtime-dev at openjdk.__java.net
>> <mailto:hotspot-runtime-dev at openjdk.java.net>
>> <mailto:hotspot-runtime-dev at __openjdk.java.net
>> <mailto:hotspot-runtime-dev at openjdk.java.net>> mailing list [...]
>>
>> ====================
>>
>> In these days my teams are hitting a bug on the JVM 64bit on
>> Linux
>> 64bit: "...there is bug in JVM for overall scheduling during
>> Sytem time
>> changes backward, which also impacts very basic Object.wait &
>> Thread.sleep methods. It becomes too risky to keep Java App
>> running when
>> system time switches back by even certain seconds. You never
>> know what
>> your Java App will end up to." (source: stackoverflow.com
>> <http://stackoverflow.com>
>> <http://stackoverflow.com>)
>>
>>
>> These are some of the consequences:
>> http://bugs.sun.com/__bugdatabase/view_bug.do?bug___id=7139684
>> <http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7139684>
>> http://bugs.sun.com/__bugdatabase/view_bug.do?bug___id=6311057
>> <http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6311057>:
>> http://bugs.sun.com/__bugdatabase/view_bug.do?bug___id=7139684
>> <http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7139684>
>>
>> The original bug is private, but I was told it's a P4 that
>> unfortunately
>> it's not looked after and gets simply shifted from this release
>> to the
>> next one
>> http://bugs.sun.com/__bugdatabase/view_bug.do?bug___id=6900441
>> <http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6900441>
>>
>> See also here for a stackoverflow drill:
>>
>> http://stackoverflow.com/__questions/9044423/java-__scheduler-which-is-completely-__independent-of-system-time-__changes
>>
>>
>> <http://stackoverflow.com/questions/9044423/java-scheduler-which-is-completely-independent-of-system-time-changes>
>>
>>
>> Such bug is NOT fixed in the latest JVM, so the recommended
>> course of
>> action is to restart the VM if a bit time jump happens (on small
>> jumps
>> the JVM will catch up). This is consistently happening on a
>> 64bitvm when
>> used on a 64bit linux system, regardless of the monotonicity
>> of the
>> underlying OS (at least apparently).
>>
>> Note that this should not happen for primitives such as
>> System.nanoTime() (like the queue used internally for
>> ScheduledExecutor)
>> that should work correctly in presence of a monotonic system:
>>
>> jlong os::javaTimeNanos() {
>> if (Linux::supports_monotonic___clock()) {
>> struct timespec tp;
>> int status = Linux::clock_gettime(CLOCK___MONOTONIC, &tp);
>> assert(status == 0, "gettime error");
>> jlong result = jlong(tp.tv_sec) * (1000 * 1000 * 1000) +
>> jlong(tp.tv_nsec);
>> return result;
>> } else {
>> timeval time;
>> int status = gettimeofday(&time, NULL);
>> assert(status != -1, "linux error");
>> jlong usecs = jlong(time.tv_sec) * (1000 * 1000) +
>> jlong(time.tv_usec);
>> return 1000 * usecs;
>> }
>> }
>>
>> Unfortunately, for some reasons, this is not the case on 1.6+
>> 64bitVM on
>> 64bitLinux. Furthermore, to be more clear about the issue, the
>> extent of
>> it and the concurrency library, let me introduce this very
>> simple program:
>>
>> import java.util.concurrent.locks.__LockSupport;
>>
>> public class Main {
>>
>> public static void main(String[] args) {
>>
>> for (int i=100; i>0; i--) {
>> System.out.println(i);
>> LockSupport.parkNanos(1000L*__1000L*1000L);
>> }
>>
>> System.out.println("Done!");
>> }
>> }
>>
>> While running it with a 64bit 1.6+ JVM on 64bit Linux, turn the
>> clock
>> down one hour and wait until the counter stops... magic! I
>> tested this
>> on JDK6, JDK7 and latest JDK8 beta running on various Ubuntu
>> distros.
>> It's not just a matter of (old?) sleep() and wait() primitives,
>> it also
>> affects the new concurrency library. Please note that classic
>> sleep()
>> works correctly on JDK1.4: it qualifies this bug as a regression
>> to me,
>> and the fact that it's there since at least 7 years kind of
>> troubles me.
>>
>> This is something we cannot easily manage as our software is
>> installed
>> on-premises to our customers, hence we have no control at all
>> about time
>> changes: if our application hangs, we are pretty much in big
>> trouble.
>>
>> I'd really like to get your view on the matter.
>>
>> Thanks in advance,
>>
>> Bruno
>>
>>
>>
More information about the hotspot-runtime-dev
mailing list