Leap Second Kernel issue (6/30/2012) and JDK-6900441

Tue Apr 21 01:50:53 UTC 2015

Hi Andrew,

On 21/04/2015 12:48 AM, Tolbert, Andrew wrote:
> Hi David,
>
> Thank you very much for your response.   My apologies on the confusion
> with the issue numbers and not providing adequate information.  You
> are correct in that I mistakenly mention 6311057.  I did indeed mean
> JDK-6900441 (https://bugs.openjdk.java.net/browse/JDK-6900441).   I
> must have made this mistake when referencing issues, as I was looking
> at 6311057 as 6900441 was mentioned as a continuation of it, and when
> I went to write up this email I must have mistaken the issues for one
> another.
>
>> The fix for 6900441 finally made the change to using CLOCK_MONOTONIC for the pthread_cond_t variables for relative waits (which makes them almost immune to changes in the TOD).
>
> This is what I believe resolves the issue.  The kernel bug I am
> referring to (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1020285)
> mentions the following:
>
>> Software that relies on fine-grained pthread timeouts will spin indefinitely and drive up system load following a leap second, when the kernel's idea of time has become desynced and sub-1s timeouts are all hit immediately.
>
> A good explanation can be described here: https://lwn.net/Articles/504744/:
>
>> ... The hrtimer offsets allow the timer subsystem to quickly turn a system time into a time value appropriate for a specific processor's realtime clock.
>>
>> If the system time changes, those offsets must be adjusted accordingly. There is a function called clock_was_set() that handles this task. As long as any system time change is followed by a call to clock_was_set(), all will be well. The problem, naturally, is that the kernel failed to call clock_was_set() after the leap second adjustment, which certainly qualifies as a system time change. So the hrtimer subsystem's idea of the current time moved forward while the system time was held back for a second; hrtimers were thereafter operating one second in the future. The result of that offset is that timers started expiring one second sooner than they should have; that is not quite what the timer developers had in mind when they used the term "high resolution."
>>
>> For many applications, having a timer go off one second early is not a big problem. But there are plenty of situations where timers are set for less than one second in the future; all such timers will naturally expire immediately if the timer subsystem is operating one second ahead of the system time. Many of these timers are also recurring timers; they will be re-set immediately after expiration, at which point they will immediately expire again — and so on. The resulting loop is the source of the load spikes reported by victims of this bug across the net.
>
> My thoughts are that in the fix for 6900441 to switch to
> CLOCK_MONOTONIC for relative waits makes java no longer susceptible to
> this kernel issue (at least in this particular case), since the
> situation where the hrtimer offset is not adjusted but the system time
> is does not effect CLOCK_MONOTONIC since it is not adjusted during a
> leap second.  Does that assumption sound correct?

Sounds reasonable. I was working on the assumption that after the 
initial early return, if we called pthread_condtimedwait again it would 
have new timeout value and so would not be impacted by the TOD change 
that had happened earlier. But based on the above the hrtimer's notion 
of "now" will still be wrong and will cause the wait to again return 
immediately.

>> The fix for 8037340 reverted the code back to calculating the time-to-wait based on CLOCK_REALTIME as that is the only clock the platform/OS "semaphore" uses. This has been inadvertently changed by the fix for 6900441. There's only one use of that Semaphore the sr_semaphore used for a simple suspend/resume mechanism. It does a 2ns timed-wait. This is susceptible to changes in the TOD: if time goes forward we will return prematurely; if time goes back, we will wait longer. But for small changes, like a second, the effect would be benign.
>
> With the kernel issue I just described, it seems to me that the
> timed-wait will return prematurely in all cases until the timers are
> corrected (either by setting the date explicitly or resetting the
> system).   Looking at how the timedwait function is used in
> os_linux.cpp, in do_suspend if timedwait returns false, it will
> simpily do a sr_semaphore.wait().  However, in do_resume it will
> repeatedly do a timedwait until it returns true.  If the timer offsets
> get off by 1 second in the future with the system clock, wouldn't this
> introduce a problem?  Any way I could try to manifest an issue around
> there?

Yes. Again I was assuming subsequent calls would cause the correct 
current time to be used.

Cheers,
David

> Thanks,
> Andy
>
> On Sun, Apr 19, 2015 at 9:59 PM, David Holmes <david.holmes at oracle.com> wrote:
>> Hi Andrew,
>>
>> I've attempted to answer this email a few times now but eventually realized
>> there is not enough information to really understand what you have observed.
>> You refer to the fix for 6311057
>>
>> https://bugs.openjdk.java.net/browse/JDK-6311057
>>
>> as being responsible for use of CLOCK_MONOTONIC, but it wasn't - that was
>> 6900441 as per your subject line. So I'm unclear when you really mean
>> 6311057 and when you mean 6900441 - but 6311057 was fixed way back in early
>> JDK 6 so would never be seen in 7 or later.
>>
>> There are three parts to this issue:
>> - how the OS/kernel handles changes to the TOD for pthread_condtimedwait
>> with CLOCK_REALTIME
>> - how the "time" read by javaTimeMillis/javaTimeNanos was affected by
>> changes to the TOD
>> - how the VM logic responded to early returns from the OS/kernel
>> pthread_condtimedwait (ie whether it read the elapsed time - and how - and
>> whether it calculated an updated time-to-wait).
>>
>> Only some timed-waiting APIs (eg Thread.sleep) prevent early returns; for
>> others (eg. Object.wait, ParkSupport.parkNanos) it is just considered a
>> "spurious wakeup" which is permitted by the spec.
>>
>> There's no simple story to explain all the possibilities as I certainly
>> can't tell you exactly which kernel behaved in what way. I tried to
>> summarise the situation in my 2009/11/11 comment in 6311057. The fix for
>> 6900441 finally made the change to using CLOCK_MONOTONIC for the
>> pthread_cond_t variables for relative waits (which makes them almost immune
>> to changes in the TOD).
>>
>> The fix for 8037340 reverted the code back to calculating the time-to-wait
>> based on CLOCK_REALTIME as that is the only clock the platform/OS
>> "semaphore" uses. This has been inadvertently changed by the fix for
>> 6900441. There's only one use of that Semaphore the sr_semaphore used for a
>> simple suspend/resume mechanism. It does a 2ns timed-wait. This is
>> susceptible to changes in the TOD: if time goes forward we will return
>> prematurely; if time goes back, we will wait longer. But for small changes,
>> like a second, the effect would be benign.
>>
>> For systems with 6311057 fixed I'm at a loss to explain the high CPU
>> utilization issue. Even if the TOD changes forward causing early returns
>> from the timed-waits, the logic in most cases would re-try the wait with a
>> re-calculated time and the second wait would not be affected by the earlier
>> TOD change. If the TOD changes backwards, as with a leap-second, then
>> potentially timed-waits will be a second longer (assuming we're not using
>> CLOCK_MONOTONIC and depending on kernel behaviour) but that would not cause
>> CPU utilization spikes - quite the opposite (unless you observe threads
>> going into busy-wait loops polling the threads that are now delayed longer
>> than expected ?).
>>
>> Cheers,
>> David
>>
>>
>> On 18/04/2015 1:57 AM, Tolbert, Andrew wrote:
>>>
>>> DISCLAIMER: I fully acknowledge that this is not a jvm issue and I
>>> know well that the real solution to this problem comes in the form a
>>> kernel-level fix, but I found this interesting and was hoping to get
>>> some validation of my thoughts.   I also do not have much familiarity
>>> with hotspot or JVM internals, so I apologize if I've said anything
>>> wrong here or am not following the correct procedures.
>>>
>>> I've been doing some investigation pending the leap second that will
>>> be added on June 30, 2015 at 23:59:59 UTC
>>> (https://hpiers.obspm.fr/eoppc/bul/bulc/bulletinc.49).  Back on June
>>> 30, 2012 a bunch of services were impacted by the notorious leap
>>> second bug in the linux kernel
>>> (https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1020285).   Some
>>> of these applications were java-based (cassandra, hadoop,
>>> elasticsearch, etc.).
>>>
>>> To get a better understanding of the issue, I ran a test on an old
>>> ubuntu install running 10.10 and kernel version 2.6.35-32.    I set up
>>> a test with a completely idle cassandra 1.2.19 instance using jdk7u75
>>> and used John Stultz's leap-a-day
>>> (https://github.com/johnstultz-work/timetests/blob/master/leap-a-day.c)
>>> program to set the time near midnight and inject a leap second at
>>> 23:59:59.  What I found surprised me a little bit,  I was not able to
>>> manifest the issue that I had previously experienced in 2012!
>>>
>>> That got me thinking, maybe there was some particular change in the
>>> JVM to accommodate for this issue?  I went back to JDK6u32 (version I
>>> believe I was running back in 2012) and reran my test, and my cpu
>>> utilization pegged out at exactly midnight!
>>>
>>> I then tried upgrading to the latest JDK6 version (u45) and also could
>>> reproduce the issue.  I tried an early arbitrary version of JDK7 and
>>> the issue also manifested.  I then went through my test with various
>>> JDK updates and found that the earliest release where I could not
>>> reproduce the issue was 7u60.
>>>
>>> Looking through the bug fix list for u60
>>>
>>> (http://www.oracle.com/technetwork/java/javase/2col/7u60-bugfixes-2202029.html),
>>> I found two fixes around time that seemed interesting:
>>>
>>> - JDK-6311057: Java Thread.sleep(timeout) is influenced by changes to
>>> System time on Linux
>>> (https://bugs.openjdk.java.net/browse/JDK-6311057)
>>> - JDK-8037340: Linux semaphores to use CLOCK_REALTIME
>>> (https://bugs.openjdk.java.net/browse/JDK-8037340)
>>>
>>> JDK-6311057 switches to use CLOCK_MONOTONIC instead of gettimeofday in
>>> compute_abstime
>>>
>>> (http://hg.openjdk.java.net/jdk7u/jdk7u/hotspot/rev/fd506dce728a?revcount=240)
>>>
>>> JDK-8037340 was a change made in response to JDK-6311057 to use
>>> CLOCK_REALTIME in Semaphore::timedWait
>>>
>>> (http://hg.openjdk.java.net/jdk7u/jdk7u/hotspot/rev/3031c29f30b5?revcount=240)
>>>
>>> My theory was that JDK-6311057 causes the jvm to behave correctly
>>> during the leap second bug because it uses monotonic time instead of
>>> gettimeofday/CLOCK_REALTIME, which will be susceptible to the leap
>>> second kernel bug, and code that does things like sleep/park in a loop
>>> for less than a second until a condition is met basically enters a
>>> busy loop and consumes cpu.
>>>
>>> I backed out JDK-6311057 at jdk7u60 and built openjdk and reran my
>>> test low and behold, as soon as the leap second is inserted my
>>> cassandra jvm consumes 100% cpu.
>>>
>>> So what I'm hoping to get out of this is:
>>>
>>> 1. Does my line of reasoning seem correct?  Is it JDK-6311057 that is
>>> likely protecting my JVM from entering a busy loop and needlessly
>>> consuming cpu?
>>> 2. As JDK-8037340 maintains that Semaphore::timedWait uses
>>> CLOCK_REALTIME, could I anticipate any possible issues involving leap
>>> second?   I was trying to think of a scenario that would exercise this
>>> logic during a leap second, but I am not familiar enough with hotspot
>>> and how java code would map to this to try to exercise it during a
>>> leap second.  I tried a program that in effect does
>>> while(!sem.tryAcquire(1, TimeUnit.microseconds)), but it behaves
>>> normally after a leap second with the kernel issue.
>>>
>>> Thank you for your time!
>>>
>>> Regards,
>>> Andy
>>>
>>