RFR: 8220795: Introduce TimedYield utility to improve time-to-safepoint on Windows

Fri Apr 12 09:15:08 UTC 2019

Hi Robin,

Sorry for the delay I've been mulling over this one ... :)

On 11/04/2019 12:02 am, Robin Westberg wrote:
> Hi David,
> 
> Thanks for the detailed review!
> 
>> On 9 Apr 2019, at 04:45, David Holmes <david.holmes at oracle.com> wrote:
>>
>> Hi Robin,
>>
>> On 8/04/2019 10:47 pm, Robin Westberg wrote:
>>> Hi again,
>>> Here’s an updated version where I’ve moved the naked_short_nanosleep function into the Posix class, to avoid future cross-platform use. (It’s still used in the SpinYield and TimedYield implementations though).
>>
>> But you also changed the existing Windows os::naked_short_sleep to use the WaitableTimer which is a significant change to make. Is this just because it will likely have better resolution than the native Sleep function?
>>
>> Code using os::naked_short_sleep(1) might be unexpectedly impacted by this if what was a 10ms (or worse) sleep becomes closer to 1ms. PerfDataManager::destroy in particular could be impacted as its racy to begin with. That's not to say this isn't a good thing to fix, just be aware it may have unexpected consequences.
> 
> You are right, my thinking was that it would be nice to retain the more exact version found in the nanosleep implementation that I was removing. But I’d be fine with doing that as a separate change and run additional testing on it. I can file a separate RFE to keep track of it.
> 
>>> Full webrev: http://cr.openjdk.java.net/~rwestberg/8220795/webrev.01/
>>> Incremental: http://cr.openjdk.java.net/~rwestberg/8220795/webrev.00-01/
>>
>> src/hotspot/share/utilities/spinYield.cpp
>>
>> I'm somewhat dubious about getting rid of non-Posix naked_short_nanosleep and instead adding a win32 ifdef to this code. It kind of defeats the purpose of the os abstraction layer - albeit that Windows can't really do a nanosleep.
>>
>> Why did you get rid of the sleep_ns parameter and hardwire it to 1000? A configurable sleep time would be a feature of this utility.
> 
> The original implementation of SpinYield had the sleep hardwired to 1 ms os::naked_short_sleep - when os::naked_short_nanosleep was introduced this parameter was added as well, with a default value of 1000. But there’s no code that actually sets the parameter, and it’s a bit misleading that the parameter accepts nanoseconds when that cannot be acted upon on Windows. So I figured that it would be better to remove the parameter but retain the existing behavior. But this should probably be revisited when the fate of TimedYield has been decided..

I evaluate the abstraction and API that is being provided and even if 
the initial user doesn't want anything but the default value, the 
ability to set the value makes perfect sense for that particular 
abstraction. The fact it can't be acted upon on windows is unfortunate, 
but these things should always been specified as "best effort" when we 
can't have guarantees.

>> ---
>>
>> src/hotspot/share/utilities/timedYield.hpp
>>
>> 36 // scheduled on the same cpu as the waiter, we will first try local cpu
>> 37 // yielding until we reach OS sleep primitive granularity in the waiting
>>
>> Whether or not the yielding is "cpu local" will depend on the OS and the scheduler in use. I would just refer to OS native yield.
> 
> The intention is to coerce the scheduler to perform a cpu-local yield if possible - more on that later.
> 
>>   40   jlong _yield_time_ns;
>>   41   jlong _sleep_time_ns;
>>   42   jlong _max_yield_time_ns;
>>   43   jlong _yield_granularity_ns;
>>
>> We are avoiding using j-types in the VM unless they actually pertain to Java variables. These should be int64_t (to be compatible with javaTimeNanos call).
> 
> Thanks, will change.
> 
>>   52   // Perform next round of delay.
>>   53   void wait();
>>
>> I think delay() would be a better name than wait().
> 
> This was inspired by the SpinYield utility, but I certainly wouldn’t mind renaming it. Perhaps SpinYield::wait should be renamed as well to keep the symmetry?

If its already in use then wait() is okay.

>> ---
>>
>> src/hotspot/share/utilities/timedYield.cpp
>>
>>   42   // Phase 1 - local cpu yielding
>>   43   if (_yield_time_ns < _max_yield_time_ns) {
>>   44 #ifdef WIN32
>>   45     if (SwitchToThread() == 0) {
>>   46       // Nothing else is ready to run on this cpu, spin a little
>>   47       while (os::javaTimeNanos() - start < _yield_granularity_ns) {
>>   48         SpinPause();
>>   49       }
>>   50     }
>>   51 #else
>>   52     os::Posix::naked_short_nanosleep(_yield_granularity_ns);
>>   53 #endif
>>   54     _yield_time_ns += os::javaTimeNanos() - start;
>>   55     return;
>>   56   }
>>
>> I have a few issues with this code. It's breaking the os abstraction layer again by using an OS ifdef and not using os APIs that exist for this very purpose - i.e os::naked_yield(). And for non-Windows it seems quite bizarre the "yielding" part of TimedYield is actually implemented with a sleep and not os::naked_yield!
> 
> The “root” problem here is that the existing os primitives unfortunately do not quite map to the goal of this utility. Let me try to break it down a bit and see if it makes sense: The purpose of TimedYield is to wait for a thread rendezvous, for a short time as possible. (In this case, waiting for threads to notice that the safepoint poll has been armed). Ideally, we are aiming for sub-millisecond waiting times here.
> 
> If these threads are scheduled on other cpu’s this is pretty simple - we could spin or nanosleep and they would make progress. However - if one or more of these threads are scheduled to run on the current cpu things become interesting. Waiting for the OS scheduler to move threads to different cpu’s can take milliseconds - much slower than what is possible to achieve. So, we want to try performing a cpu-local yield at first.

This is heading into territory that we only go into if absolutely 
necessary. We don't want to be coding to our assumed knowledge of what a 
particular scheduler will do - especially when we don't even know we 
will be executing on that scheduler. Unless there is very good reason we 
should stick with the functionality and semantics provided by the OS.

> On Windows, this maps reasonably well to os::naked_yield:
> 
> void os::naked_yield() {
>    // Consider passing back the return value from SwitchToThread().
>    SwitchToThread();
> }
> 
> But for example on Linux, there is instead this:
> 
> // Linux CFS scheduler (since 2.6.23) does not guarantee sched_yield(2) will
> // actually give up the CPU. Since skip buddy (v2.6.28):
> //
> // * Sets the yielding task as skip buddy for current CPU's run queue.
> // * Picks next from run queue, if empty, picks a skip buddy (can be the yielding task).
> // * Clears skip buddies for this run queue (yielding task no longer a skip buddy).
> //
> // An alternative is calling os::naked_short_nanosleep with a small number to avoid
> // getting re-scheduled immediately.
> //
> void os::naked_yield() {
>    sched_yield();
> }
> 
> In both cases, we may get rescheduled immediately - on Windows this is indicated in the return value from SwitchToThread, but on Linux we don’t know. On Windows, it is then fine to spin a little while as there is nothing else ready to run. But on Linux, the CFS scheduler penalizes spinning as the runtime counter is increased, which will hurt the waiter when the time comes to perform actual work. So we don’t want to spin on a no-op sched_yield, we have to use nanosleep instead. But then we are back to the original problem - the current nanosleep is not what we want to do on Windows in this situation So why not change nanosleep on Windows:

Yielding should always be a hint, not a requirement. Trying to second 
guess who may be running on which core and what the load may be is not a 
game we want to play lightly. There are just too factors out of our 
control that can change on a different piece of hardware or a different 
OS release etc.

>> If the existing os api's need adjustment or expansion to provide the functionality desired by this code then I would much prefer to see the os API's updated to address that.
>>
>> That said, given the original problem is that os::naked_short_nanosleep on Windows is too coarse with the use of WaitableTimer why not just replace that with a simple loop of the form:
>>
>> while (elapsed < sleep_ns) {
>>   if (SwitchToThread() == 0) {
>>     SpinPause();
>>   elapsed = …
>> }
> 
> So this would actually work fine in this case - but it's probably not what you would expect from a sleep function in the general case. On Linux, you would get control back after the provided nanosecond period even if another thread executed in the meantime. But on Windows, you are potentially giving up your entire timeslice if another thread is ready to run - this would be much worse than plain naked_short_sleep as you may not get control back for another 15 ms or so.

Perhaps you have different expectations on what a sleep may do, but I 
don't expect absolute precision here. I expect a sleep function to take 
me off CPU for (ideally**) at least a given amount of time, but have no 
expectations about the maximum - that depends on a lot of factors about 
os timers and scheduling that I don't want to have to think about or 
know about. I don't even assume I go off cpu for all that time as I know 
there are limits around timer resolution etc and so the OS may in fact 
do some spinning instead. That's all fine by me. If the sleeping thread 
doesn't get back on CPU for 15-20ms then so be it, some other thread is 
getting to run and do what it needs to do.

** Windows returns early from timed-waits in many cases.

> That all being said, switching the Windows naked_short_nanosleep to the above implementation would be just fine - but I really think it should be renamed in that case. Perhaps something like os::timed_yield(jlong ns) would make sense? The additional backoff mechanism in ThreadYield can be reverted back to being handled by the safepointing code. The reason I made TimedYield into a separate utility was that it may be useful in other places as well, but such future use can of course be handled separately if the need actually arises.

I'd prefer to fix a windows problem, just on windows. I'm not hung up on 
having sleep in the name, but if you prefer timed_yield to 
naked_short_nanosleep then that's fine (and avoids people wondering what 
the "naked" part means).

If we need the TimedYield capability in the future then lets revisit 
that then.

Thanks,
David
-----

> Best regards,
> Robin
> 
>>
>> ?
>>
>> Thanks,
>> David
>> -----
>>
>>
>>> Best regards,
>>> Robin
>>>> On 5 Apr 2019, at 13:54, Robin Westberg <robin.westberg at oracle.com> wrote:
>>>>
>>>> Hi David,
>>>>
>>>>> On 5 Apr 2019, at 12:10, David Holmes <david.holmes at oracle.com> wrote:
>>>>>
>>>>> On 5/04/2019 7:53 pm, Robin Westberg wrote:
>>>>>> Hi David,
>>>>>> Thanks for taking a look!
>>>>>>> On 5 Apr 2019, at 10:49, David Holmes <david.holmes at oracle.com> wrote:
>>>>>>>
>>>>>>> Hi Robin,
>>>>>>>
>>>>>>> On 5/04/2019 6:05 pm, Robin Westberg wrote:
>>>>>>>> Hi all,
>>>>>>>> Please review the following change that modifies the way safepointing waits for other threads to stop. As part of JDK-8203469, os::naked_short_nanosleep was used for all platforms. However, on Windows, this function has millisecond granularity which is too coarse and caused performance regressions. Instead, we can take advantage of the fact that Windows can tell us if anyone else is waiting to run on the current cpu. For other platforms the original waiting is used.
>>>>>>>
>>>>>>> Can't you just make the new code the implementation of os::naked_short_nanosleep on Windows and avoid adding yet another sleep/yield abstraction? If Windows nanosleep only has millisecond granularity then it's broken.
>>>>>> Right, I considered that approach, but it's not obvious to me what semantics would be appropriate for that. Depending on whether you want an accurate sleep or want other threads to make progress, it could conceivably be either implemented as pure spinning, or spinning combined with yielding (either cpu-local or Sleep(0)-style).
>>>>>> That said, perhaps os_naked_short_nanosleep should be removed, the only remaining usage is in spinYield, which turns to sleeping after giving up on yielding. For windows, it may be more appropriate to switch to Sleep(0) in that case.
>>>>>
>>>>> I definitely don't think we need to introduce another abstraction (though I'm okay with replacing one with a new one).
>>>>
>>>> Okay, then I’d prefer to drop the os::naked_short_nanosleep. As I see it, you would use SpinYield when waiting for something like a CAS race, and TimedYield when waiting for a thread rendezvous.
>>>>
>>>> I did try to make TimedYield into a different “policy” for the existing SpinYield class at first, but didn’t quite feel like I found a nice API for it. So perhaps it's fine to just have them as separate classes.
>>>>
>>>>> It was probably discussed at the time but the naked_short_nanosleep on Windows definitely seems to have far too much overhead - even having to create a new waitable-timer each time. Do we know if the overhead is actually the resolution of the timer or whether it's all the extra work that method needs to do? I should try to dig up the email on that. :)
>>>>
>>>> Yes, as far as I know the best possible sleep resolution you can obtain on Windows is one millisecond (perhaps 0.5ms in theory if you go directly for NtSetTimerResolution). The SetWaitableTimer approach is the best though to actually obtain 1ms resolution, plain Sleep(1) usually sleeps around 2ms.
>>>>
>>>> Best regards,
>>>> Robin
>>>>
>>>>>
>>>>> Thanks,
>>>>> David
>>>>>
>>>>>> Best regards,
>>>>>> Robin
>>>>>>>
>>>>>>> Thanks,
>>>>>>> David
>>>>>>> -----
>>>>>>>
>>>>>>>> Various benchmarks (specjbb2015, specjm2008) show better numbers for Windows and no regressions for Linux.
>>>>>>>> Issue: https://bugs.openjdk.java.net/browse/JDK-8220795
>>>>>>>> Webrev: http://cr.openjdk.java.net/~rwestberg/8220795/webrev.00/
>>>>>>>> Testing: tier1
>>>>>>>> Best regards,
>>>>>>>> Robin
>