RFR: 8220795: Introduce TimedYield utility to improve time-to-safepoint on Windows

Wed Apr 10 14:02:51 UTC 2019

Hi David,

Thanks for the detailed review!

> On 9 Apr 2019, at 04:45, David Holmes <david.holmes at oracle.com> wrote:
> 
> Hi Robin,
> 
> On 8/04/2019 10:47 pm, Robin Westberg wrote:
>> Hi again,
>> Here’s an updated version where I’ve moved the naked_short_nanosleep function into the Posix class, to avoid future cross-platform use. (It’s still used in the SpinYield and TimedYield implementations though).
> 
> But you also changed the existing Windows os::naked_short_sleep to use the WaitableTimer which is a significant change to make. Is this just because it will likely have better resolution than the native Sleep function?
> 
> Code using os::naked_short_sleep(1) might be unexpectedly impacted by this if what was a 10ms (or worse) sleep becomes closer to 1ms. PerfDataManager::destroy in particular could be impacted as its racy to begin with. That's not to say this isn't a good thing to fix, just be aware it may have unexpected consequences.

You are right, my thinking was that it would be nice to retain the more exact version found in the nanosleep implementation that I was removing. But I’d be fine with doing that as a separate change and run additional testing on it. I can file a separate RFE to keep track of it.

>> Full webrev: http://cr.openjdk.java.net/~rwestberg/8220795/webrev.01/
>> Incremental: http://cr.openjdk.java.net/~rwestberg/8220795/webrev.00-01/
> 
> src/hotspot/share/utilities/spinYield.cpp
> 
> I'm somewhat dubious about getting rid of non-Posix naked_short_nanosleep and instead adding a win32 ifdef to this code. It kind of defeats the purpose of the os abstraction layer - albeit that Windows can't really do a nanosleep.
> 
> Why did you get rid of the sleep_ns parameter and hardwire it to 1000? A configurable sleep time would be a feature of this utility.

The original implementation of SpinYield had the sleep hardwired to 1 ms os::naked_short_sleep - when os::naked_short_nanosleep was introduced this parameter was added as well, with a default value of 1000. But there’s no code that actually sets the parameter, and it’s a bit misleading that the parameter accepts nanoseconds when that cannot be acted upon on Windows. So I figured that it would be better to remove the parameter but retain the existing behavior. But this should probably be revisited when the fate of TimedYield has been decided..

> ---
> 
> src/hotspot/share/utilities/timedYield.hpp
> 
> 36 // scheduled on the same cpu as the waiter, we will first try local cpu
> 37 // yielding until we reach OS sleep primitive granularity in the waiting
> 
> Whether or not the yielding is "cpu local" will depend on the OS and the scheduler in use. I would just refer to OS native yield.

The intention is to coerce the scheduler to perform a cpu-local yield if possible - more on that later.

>  40   jlong _yield_time_ns;
>  41   jlong _sleep_time_ns;
>  42   jlong _max_yield_time_ns;
>  43   jlong _yield_granularity_ns;
> 
> We are avoiding using j-types in the VM unless they actually pertain to Java variables. These should be int64_t (to be compatible with javaTimeNanos call).

Thanks, will change.

>  52   // Perform next round of delay.
>  53   void wait();
> 
> I think delay() would be a better name than wait().

This was inspired by the SpinYield utility, but I certainly wouldn’t mind renaming it. Perhaps SpinYield::wait should be renamed as well to keep the symmetry?

> ---
> 
> src/hotspot/share/utilities/timedYield.cpp
> 
>  42   // Phase 1 - local cpu yielding
>  43   if (_yield_time_ns < _max_yield_time_ns) {
>  44 #ifdef WIN32
>  45     if (SwitchToThread() == 0) {
>  46       // Nothing else is ready to run on this cpu, spin a little
>  47       while (os::javaTimeNanos() - start < _yield_granularity_ns) {
>  48         SpinPause();
>  49       }
>  50     }
>  51 #else
>  52     os::Posix::naked_short_nanosleep(_yield_granularity_ns);
>  53 #endif
>  54     _yield_time_ns += os::javaTimeNanos() - start;
>  55     return;
>  56   }
> 
> I have a few issues with this code. It's breaking the os abstraction layer again by using an OS ifdef and not using os APIs that exist for this very purpose - i.e os::naked_yield(). And for non-Windows it seems quite bizarre the "yielding" part of TimedYield is actually implemented with a sleep and not os::naked_yield!

The “root” problem here is that the existing os primitives unfortunately do not quite map to the goal of this utility. Let me try to break it down a bit and see if it makes sense: The purpose of TimedYield is to wait for a thread rendezvous, for a short time as possible. (In this case, waiting for threads to notice that the safepoint poll has been armed). Ideally, we are aiming for sub-millisecond waiting times here.

If these threads are scheduled on other cpu’s this is pretty simple - we could spin or nanosleep and they would make progress. However - if one or more of these threads are scheduled to run on the current cpu things become interesting. Waiting for the OS scheduler to move threads to different cpu’s can take milliseconds - much slower than what is possible to achieve. So, we want to try performing a cpu-local yield at first.

On Windows, this maps reasonably well to os::naked_yield:

void os::naked_yield() {
  // Consider passing back the return value from SwitchToThread().
  SwitchToThread();
}

But for example on Linux, there is instead this:

// Linux CFS scheduler (since 2.6.23) does not guarantee sched_yield(2) will
// actually give up the CPU. Since skip buddy (v2.6.28):
//
// * Sets the yielding task as skip buddy for current CPU's run queue.
// * Picks next from run queue, if empty, picks a skip buddy (can be the yielding task).
// * Clears skip buddies for this run queue (yielding task no longer a skip buddy).
//
// An alternative is calling os::naked_short_nanosleep with a small number to avoid
// getting re-scheduled immediately.
//
void os::naked_yield() {
  sched_yield();
}

In both cases, we may get rescheduled immediately - on Windows this is indicated in the return value from SwitchToThread, but on Linux we don’t know. On Windows, it is then fine to spin a little while as there is nothing else ready to run. But on Linux, the CFS scheduler penalizes spinning as the runtime counter is increased, which will hurt the waiter when the time comes to perform actual work. So we don’t want to spin on a no-op sched_yield, we have to use nanosleep instead. But then we are back to the original problem - the current nanosleep is not what we want to do on Windows in this situation So why not change nanosleep on Windows:

> If the existing os api's need adjustment or expansion to provide the functionality desired by this code then I would much prefer to see the os API's updated to address that.
> 
> That said, given the original problem is that os::naked_short_nanosleep on Windows is too coarse with the use of WaitableTimer why not just replace that with a simple loop of the form:
> 
> while (elapsed < sleep_ns) {
>  if (SwitchToThread() == 0) {
>    SpinPause();
>  elapsed = …
> }

So this would actually work fine in this case - but it's probably not what you would expect from a sleep function in the general case. On Linux, you would get control back after the provided nanosecond period even if another thread executed in the meantime. But on Windows, you are potentially giving up your entire timeslice if another thread is ready to run - this would be much worse than plain naked_short_sleep as you may not get control back for another 15 ms or so.

That all being said, switching the Windows naked_short_nanosleep to the above implementation would be just fine - but I really think it should be renamed in that case. Perhaps something like os::timed_yield(jlong ns) would make sense? The additional backoff mechanism in ThreadYield can be reverted back to being handled by the safepointing code. The reason I made TimedYield into a separate utility was that it may be useful in other places as well, but such future use can of course be handled separately if the need actually arises.

Best regards,
Robin

> 
> ?
> 
> Thanks,
> David
> -----
> 
> 
>> Best regards,
>> Robin
>>> On 5 Apr 2019, at 13:54, Robin Westberg <robin.westberg at oracle.com> wrote:
>>> 
>>> Hi David,
>>> 
>>>> On 5 Apr 2019, at 12:10, David Holmes <david.holmes at oracle.com> wrote:
>>>> 
>>>> On 5/04/2019 7:53 pm, Robin Westberg wrote:
>>>>> Hi David,
>>>>> Thanks for taking a look!
>>>>>> On 5 Apr 2019, at 10:49, David Holmes <david.holmes at oracle.com> wrote:
>>>>>> 
>>>>>> Hi Robin,
>>>>>> 
>>>>>> On 5/04/2019 6:05 pm, Robin Westberg wrote:
>>>>>>> Hi all,
>>>>>>> Please review the following change that modifies the way safepointing waits for other threads to stop. As part of JDK-8203469, os::naked_short_nanosleep was used for all platforms. However, on Windows, this function has millisecond granularity which is too coarse and caused performance regressions. Instead, we can take advantage of the fact that Windows can tell us if anyone else is waiting to run on the current cpu. For other platforms the original waiting is used.
>>>>>> 
>>>>>> Can't you just make the new code the implementation of os::naked_short_nanosleep on Windows and avoid adding yet another sleep/yield abstraction? If Windows nanosleep only has millisecond granularity then it's broken.
>>>>> Right, I considered that approach, but it's not obvious to me what semantics would be appropriate for that. Depending on whether you want an accurate sleep or want other threads to make progress, it could conceivably be either implemented as pure spinning, or spinning combined with yielding (either cpu-local or Sleep(0)-style).
>>>>> That said, perhaps os_naked_short_nanosleep should be removed, the only remaining usage is in spinYield, which turns to sleeping after giving up on yielding. For windows, it may be more appropriate to switch to Sleep(0) in that case.
>>>> 
>>>> I definitely don't think we need to introduce another abstraction (though I'm okay with replacing one with a new one).
>>> 
>>> Okay, then I’d prefer to drop the os::naked_short_nanosleep. As I see it, you would use SpinYield when waiting for something like a CAS race, and TimedYield when waiting for a thread rendezvous.
>>> 
>>> I did try to make TimedYield into a different “policy” for the existing SpinYield class at first, but didn’t quite feel like I found a nice API for it. So perhaps it's fine to just have them as separate classes.
>>> 
>>>> It was probably discussed at the time but the naked_short_nanosleep on Windows definitely seems to have far too much overhead - even having to create a new waitable-timer each time. Do we know if the overhead is actually the resolution of the timer or whether it's all the extra work that method needs to do? I should try to dig up the email on that. :)
>>> 
>>> Yes, as far as I know the best possible sleep resolution you can obtain on Windows is one millisecond (perhaps 0.5ms in theory if you go directly for NtSetTimerResolution). The SetWaitableTimer approach is the best though to actually obtain 1ms resolution, plain Sleep(1) usually sleeps around 2ms.
>>> 
>>> Best regards,
>>> Robin
>>> 
>>>> 
>>>> Thanks,
>>>> David
>>>> 
>>>>> Best regards,
>>>>> Robin
>>>>>> 
>>>>>> Thanks,
>>>>>> David
>>>>>> -----
>>>>>> 
>>>>>>> Various benchmarks (specjbb2015, specjm2008) show better numbers for Windows and no regressions for Linux.
>>>>>>> Issue: https://bugs.openjdk.java.net/browse/JDK-8220795
>>>>>>> Webrev: http://cr.openjdk.java.net/~rwestberg/8220795/webrev.00/
>>>>>>> Testing: tier1
>>>>>>> Best regards,
>>>>>>> Robin