Spin Loop Hint support: Draft JEP proposal

Wed Oct 7 15:45:44 UTC 2015

Sent from Gil's iPhone

> On Oct 7, 2015, at 1:14 AM, Andrew Haley <aph at redhat.com> wrote:
> 
>> On 05/10/15 21:43, Gil Tene wrote:
>> 
>> I see SpinLoopHint as very separate from things like MONITOR/WAIT
>> (on x86) and WFE/SEV (on ARM), as well as any other "wait in a nice
>> way until this state changes" instructions that other architectures
>> may have or add.
>> 
>> Mechanisms like MONITOR/WAIT and WFE/SEV provide a way to
>> potentially wait for specific state changes to occur. As such, they
>> can be used to implement a specific form of a spin loop (the most
>> common one, probably). But they do not provide for generic spinning
>> forms. E.g. loops that have multiple exit conditions in different
>> memory locations, loops that wait on internal state changes that are
>> no affected by other CPUs (like "spin only this many times before
>> giving up" or "spin for this much time"), and loops that may use
>> transactional state changes (e.g. LOCK XADD, or wider things with
>> TSX) are probably "hard" to model with these instructions.
> 
> Yes, you're right: there's no real way to combine these things, and
> support for WFE requires some other kind of interface -- if I ever
> manage to think of a nice way to express it in Java.  So, my
> apologies for hijacking this thread, but now you've got me thinking.
> 
> In an ideal world there would be a timer associated with WFE which
> would trigger after a short while and allow a thread to be
> descheduled.  However, it is possible to set a periodic timer which
> regularly signals each worker thread, giving it the opportunity to
> block if unused for a long time.  This should make a much more
> responsive thread pool, so that when worker threads are active they
> respond in nanoseconds rather than microseconds.

The problem with using timer based interrupts to kick out of WFE or MWAIT situations is that the granularity is often too thin for timers (and interrupts). E.g. j.u.c often uses "magic number of spins" of 64 or so before backed my out of the spin. That's just too short to get a timer going (and canceled) in, and the overhead of interrupt handling will overwhelm the actual action being attempted.

"What we really need" for WFE/MEAIT hardware instructions to be useful in this space (of spin-for-bit-before-giving-up) is for the instructions to take a timeout argument (e.g. # of clock cycles. A power of two would probably suffice, so not slot of bits needed). But that's just not how they work on current HW...

> 
> [ An aside: WFE is available in user mode, and according to Intel's
> documentation it should be possible to configre an OS to use
> MONITOR/WAIT in user mode too.  I don't know why it doesn't work. ]

While there are some ambiguous suggestions that MONITOR/MWAIT may be available in CPLs above 0 in some documentation, the current documentation for the  actual MWAIT instruction is pretty clear about it only working in privilege level 0. So maybe this will be relaxed in the future?

In any case, even if it were user-mode-accessible, MWAIT may not appropriate for latency-sensitive spinning because it can apparently take 1000s of cycles to come out of the C-state modes it goes into. At those level, you may be better off blocking or yielding, and no one would make use of it if they care about quick reaction time. It may be that it's not that bad, depending on the cstate requested, but the fact that Linux kernels don't currently use MWAIT for spin loops suggests that it not good for that use case yet.

For ARM, I expect WFE/SEV to need to evolve as well, and for other reasons, even fit use within OSs. The current WFE/SEV scheme is not scalable. While it probably works ok for spinning at the kernel level on hardware that only has s handful of cores, the fact that the event WFE waits for (and SEV sends) is global to the system will break things as core counts grow (it is the hardware equivalent of wait/notifyAll() with a single global monitor). I expect that for OSs to use it for spinning on many-core systems, there would need to be some de-muxing capability added (e.g. by address or by some id). In its current form, it is probably not ready for exposure to user mode code. (Imagine what would happen if user code started doing system-wide SEVs on every unlock).