RFR: 8220795: Introduce TimedYield utility to improve time-to-safepoint on Windows

Wed Apr 17 13:55:49 UTC 2019

Hi David,

> On 12 Apr 2019, at 11:15, David Holmes <david.holmes at oracle.com> wrote:
> 
> Hi Robin,
> 
> Sorry for the delay I've been mulling over this one ... :)

No worries, I’ve been otherwise occupied anyway.. :)

>> The original implementation of SpinYield had the sleep hardwired to 1 ms os::naked_short_sleep - when os::naked_short_nanosleep was introduced this parameter was added as well, with a default value of 1000. But there’s no code that actually sets the parameter, and it’s a bit misleading that the parameter accepts nanoseconds when that cannot be acted upon on Windows. So I figured that it would be better to remove the parameter but retain the existing behavior. But this should probably be revisited when the fate of TimedYield has been decided..
> 
> I evaluate the abstraction and API that is being provided and even if the initial user doesn't want anything but the default value, the ability to set the value makes perfect sense for that particular abstraction. The fact it can't be acted upon on windows is unfortunate, but these things should always been specified as "best effort" when we can't have guarantees.

Fair enough, I’ll avoid touching that part.

>>> I think delay() would be a better name than wait().
>> This was inspired by the SpinYield utility, but I certainly wouldn’t mind renaming it. Perhaps SpinYield::wait should be renamed as well to keep the symmetry?
> 
> If its already in use then wait() is okay.

Sine I dropped the new TimedYield class I’ll leave this one as well.

>> The “root” problem here is that the existing os primitives unfortunately do not quite map to the goal of this utility. Let me try to break it down a bit and see if it makes sense: The purpose of TimedYield is to wait for a thread rendezvous, for a short time as possible. (In this case, waiting for threads to notice that the safepoint poll has been armed). Ideally, we are aiming for sub-millisecond waiting times here.
>> If these threads are scheduled on other cpu’s this is pretty simple - we could spin or nanosleep and they would make progress. However - if one or more of these threads are scheduled to run on the current cpu things become interesting. Waiting for the OS scheduler to move threads to different cpu’s can take milliseconds - much slower than what is possible to achieve. So, we want to try performing a cpu-local yield at first.
> 
> This is heading into territory that we only go into if absolutely necessary. We don't want to be coding to our assumed knowledge of what a particular scheduler will do - especially when we don't even know we will be executing on that scheduler. Unless there is very good reason we should stick with the functionality and semantics provided by the OS.

Right, that’s why I thought it could be useful to encapsulate the “optimal” strategy for thread rendezvous waiting in something that could be implemented differently depending on the OS / scheduler, as it may not be obvious that nanosleep is the primitive of choice for that. Perhaps TimedYield was not the best name for that either though..

Just for the record, here are the differences I observed using different strategies (specjvm2008 time-to-safepoint in microseconds):

			Average	Median	90th percentile
Original:		693		677		1088
SpinYield:	438		222		1014
TimedYield:	226		180		241

>> On Windows, this maps reasonably well to os::naked_yield:
>> void os::naked_yield() {
>>   // Consider passing back the return value from SwitchToThread().
>>   SwitchToThread();
>> }
>> But for example on Linux, there is instead this:
>> // Linux CFS scheduler (since 2.6.23) does not guarantee sched_yield(2) will
>> // actually give up the CPU. Since skip buddy (v2.6.28):
>> //
>> // * Sets the yielding task as skip buddy for current CPU's run queue.
>> // * Picks next from run queue, if empty, picks a skip buddy (can be the yielding task).
>> // * Clears skip buddies for this run queue (yielding task no longer a skip buddy).
>> //
>> // An alternative is calling os::naked_short_nanosleep with a small number to avoid
>> // getting re-scheduled immediately.
>> //
>> void os::naked_yield() {
>>   sched_yield();
>> }
>> In both cases, we may get rescheduled immediately - on Windows this is indicated in the return value from SwitchToThread, but on Linux we don’t know. On Windows, it is then fine to spin a little while as there is nothing else ready to run. But on Linux, the CFS scheduler penalizes spinning as the runtime counter is increased, which will hurt the waiter when the time comes to perform actual work. So we don’t want to spin on a no-op sched_yield, we have to use nanosleep instead. But then we are back to the original problem - the current nanosleep is not what we want to do on Windows in this situation So why not change nanosleep on Windows:
> 
> Yielding should always be a hint, not a requirement. Trying to second guess who may be running on which core and what the load may be is not a game we want to play lightly. There are just too factors out of our control that can change on a different piece of hardware or a different OS release etc.
> 
>>> If the existing os api's need adjustment or expansion to provide the functionality desired by this code then I would much prefer to see the os API's updated to address that.
>>> 
>>> That said, given the original problem is that os::naked_short_nanosleep on Windows is too coarse with the use of WaitableTimer why not just replace that with a simple loop of the form:
>>> 
>>> while (elapsed < sleep_ns) {
>>>  if (SwitchToThread() == 0) {
>>>    SpinPause();
>>>  elapsed = …
>>> }
>> So this would actually work fine in this case - but it's probably not what you would expect from a sleep function in the general case. On Linux, you would get control back after the provided nanosecond period even if another thread executed in the meantime. But on Windows, you are potentially giving up your entire timeslice if another thread is ready to run - this would be much worse than plain naked_short_sleep as you may not get control back for another 15 ms or so.
> 
> Perhaps you have different expectations on what a sleep may do, but I don't expect absolute precision here. I expect a sleep function to take me off CPU for (ideally**) at least a given amount of time, but have no expectations about the maximum - that depends on a lot of factors about os timers and scheduling that I don't want to have to think about or know about. I don't even assume I go off cpu for all that time as I know there are limits around timer resolution etc and so the OS may in fact do some spinning instead. That's all fine by me. If the sleeping thread doesn't get back on CPU for 15-20ms then so be it, some other thread is getting to run and do what it needs to do.

I guess my main concern is that I would probably expect better precision from the sleep method accepting nanoseconds instead of milliseconds, but I can certainly live with not changing that.

> ** Windows returns early from timed-waits in many cases.
> 
>> That all being said, switching the Windows naked_short_nanosleep to the above implementation would be just fine - but I really think it should be renamed in that case. Perhaps something like os::timed_yield(jlong ns) would make sense? The additional backoff mechanism in ThreadYield can be reverted back to being handled by the safepointing code. The reason I made TimedYield into a separate utility was that it may be useful in other places as well, but such future use can of course be handled separately if the need actually arises.
> 
> I'd prefer to fix a windows problem, just on windows. I'm not hung up on having sleep in the name, but if you prefer timed_yield to naked_short_nanosleep then that's fine (and avoids people wondering what the "naked" part means).
> 
> If we need the TimedYield capability in the future then lets revisit that then.

Sure, here’s a lighter version of this change that changes the Windows implementation of naked_short_nanosleep, with a few adjustments to some assumptions in the waiting-for-safepoint backoff strategy.

Still passes tier1, with the same performance improvements on Windows (and no obvious regressions on Linux).

New webrev:
https://cr.openjdk.java.net/~rwestberg/8220795/webrev.02/

Best regards,
Robin

> 
> Thanks,
> David
> -----
> 
>> Best regards,
>> Robin
>>> 
>>> ?
>>> 
>>> Thanks,
>>> David
>>> -----
>>> 
>>> 
>>>> Best regards,
>>>> Robin
>>>>> On 5 Apr 2019, at 13:54, Robin Westberg <robin.westberg at oracle.com> wrote:
>>>>> 
>>>>> Hi David,
>>>>> 
>>>>>> On 5 Apr 2019, at 12:10, David Holmes <david.holmes at oracle.com> wrote:
>>>>>> 
>>>>>> On 5/04/2019 7:53 pm, Robin Westberg wrote:
>>>>>>> Hi David,
>>>>>>> Thanks for taking a look!
>>>>>>>> On 5 Apr 2019, at 10:49, David Holmes <david.holmes at oracle.com> wrote:
>>>>>>>> 
>>>>>>>> Hi Robin,
>>>>>>>> 
>>>>>>>> On 5/04/2019 6:05 pm, Robin Westberg wrote:
>>>>>>>>> Hi all,
>>>>>>>>> Please review the following change that modifies the way safepointing waits for other threads to stop. As part of JDK-8203469, os::naked_short_nanosleep was used for all platforms. However, on Windows, this function has millisecond granularity which is too coarse and caused performance regressions. Instead, we can take advantage of the fact that Windows can tell us if anyone else is waiting to run on the current cpu. For other platforms the original waiting is used.
>>>>>>>> 
>>>>>>>> Can't you just make the new code the implementation of os::naked_short_nanosleep on Windows and avoid adding yet another sleep/yield abstraction? If Windows nanosleep only has millisecond granularity then it's broken.
>>>>>>> Right, I considered that approach, but it's not obvious to me what semantics would be appropriate for that. Depending on whether you want an accurate sleep or want other threads to make progress, it could conceivably be either implemented as pure spinning, or spinning combined with yielding (either cpu-local or Sleep(0)-style).
>>>>>>> That said, perhaps os_naked_short_nanosleep should be removed, the only remaining usage is in spinYield, which turns to sleeping after giving up on yielding. For windows, it may be more appropriate to switch to Sleep(0) in that case.
>>>>>> 
>>>>>> I definitely don't think we need to introduce another abstraction (though I'm okay with replacing one with a new one).
>>>>> 
>>>>> Okay, then I’d prefer to drop the os::naked_short_nanosleep. As I see it, you would use SpinYield when waiting for something like a CAS race, and TimedYield when waiting for a thread rendezvous.
>>>>> 
>>>>> I did try to make TimedYield into a different “policy” for the existing SpinYield class at first, but didn’t quite feel like I found a nice API for it. So perhaps it's fine to just have them as separate classes.
>>>>> 
>>>>>> It was probably discussed at the time but the naked_short_nanosleep on Windows definitely seems to have far too much overhead - even having to create a new waitable-timer each time. Do we know if the overhead is actually the resolution of the timer or whether it's all the extra work that method needs to do? I should try to dig up the email on that. :)
>>>>> 
>>>>> Yes, as far as I know the best possible sleep resolution you can obtain on Windows is one millisecond (perhaps 0.5ms in theory if you go directly for NtSetTimerResolution). The SetWaitableTimer approach is the best though to actually obtain 1ms resolution, plain Sleep(1) usually sleeps around 2ms.
>>>>> 
>>>>> Best regards,
>>>>> Robin
>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> David
>>>>>> 
>>>>>>> Best regards,
>>>>>>> Robin
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> David
>>>>>>>> -----
>>>>>>>> 
>>>>>>>>> Various benchmarks (specjbb2015, specjm2008) show better numbers for Windows and no regressions for Linux.
>>>>>>>>> Issue: https://bugs.openjdk.java.net/browse/JDK-8220795
>>>>>>>>> Webrev: http://cr.openjdk.java.net/~rwestberg/8220795/webrev.00/
>>>>>>>>> Testing: tier1
>>>>>>>>> Best regards,
>>>>>>>>> Robin