Spin Loop Hint support: Draft JEP proposal

Fri Oct 9 06:56:02 UTC 2015

> On Oct 8, 2015, at 6:18 PM, John Rose <john.r.rose at oracle.com> wrote:
> 
> On Oct 8, 2015, at 12:39 AM, Gil Tene <gil at azul.com> wrote:
>> 
>> On the one hand:
>> 
>> I like the idea of (an optional?) boolean parameter as a means of hinting at the thing that may terminate the spin. It's probably much more general than identifying a specific field or address. And it can be used to cover cases that poll multiple addresses (an or in the boolean) or look a termination time. If the JVM can track down the boolean's evaluation to dependencies on specific memory state changes, it could pass it on to hardware, if such hardware exists.
> 
> Yep.  And there is a user-mode MWAIT in SPARC M7, today.

Cool. Didn't know that. So now It's SPARC M7 and ARM v8. Both fairly new, but the pattern of monitoring a single address (or range) and waiting on a potential change to it seems common (and similar to the kernel mode MONITOR/MWAIT in x86). Anything similar coming (or already here) in Power or MIPS?

> For Intel, Dave Dice wrote this up:
>  https://blogs.oracle.com/dave/entry/monitor_mwait_for_spin_loops

Cool writeup. But with the current need to transition to kernel mode this may work for loops that want to idle and save power and are willing to sacrifice reaction time to do so. But it is the opposite of what a spinHintLoop() would typically be looking to do. On modern x86, for example, adding a pause instruction improves the reaction speed of the spin loop (see charts attached to JEP), but adding the trapping cost and protection mode transition of a system call to do an MWAIT will almost certainly do the opposite.

If/when MONITOR/MWAIT becomes available in user mode, it will join ARM v8 and SPARC M7 in a common useful paradigm.

> Also, from a cross-platform POV, a boolean would provide an easy to use "hook" for profiling how often the polling is failing.  Failure frequency is an important input to the tuning of spin loops, isn't it?  Why not feed that info through to the JVM?

I don't follow. Perhaps I'm missing something. Spin loops are "strange" in that they tend to not care about how "fast" they spin, but do care about their reaction time to a change in the thing(s) they are spinning on. I don't think profiling will help here…

E.g. in the example tests for this JEP on Ivy Bridge Xeons, adding an intrinsified spinLoopHint() to the a simple spin volatile value loop appears to reduce the "spin throughput" by a significant ratio (3x-5x for L1-sharing threads), but also reduces the reaction time by 35-50%.

> ...
>> and if/when it does, I'm not sure the semantics of passing the boolean through are enough to cover the actual way to use such hardware when it becomes available.
> 
> The alternative is to have the JIT pattern-match for loop control around the call to Thread.yield. That is obviously less robust than having the user thread the poll condition bit through the poll primitive.

I dont' think that's the alternative. The alternative(s) I suggest require no analysis by the JIT:

The main means of spin loop hinting I am suggesting is a simple no args hint. [Folks seem to be converging on using Thread as the home for this stuff, so I'll use that]:

E.g.:
while (!done) {
	Thread.spinLoopHint();
}

The second form I'm suggesting (mostly in reaction to discussion on this thread) directly captures the notion that a single address is being monitored:

E.g. 

volatile boolean done;
static final Field doneField = …;
...
Thread.spinExecuteWhileTrue( () -> !done, doneField, this ); // ugly method name I'm not married to...

or a slighltly more complicated: 

Thread.spinExecuteWhileTrue( () -> { count++; return !done;} , doneField, this ); 

[These Thread.spinExecuteWhileTrue() examples will execute the BooleanSupplier each time, but will only watch the specified field for changes in the spin loop, potentially pausing the loop until a change in the field is detected, but will not pause indefinitely. This can be implemented with a MONITOR/MWAIT, WFE/SEVL, or by just using a PAUSE instruction and not watching the field at all.]

(for Java 9, a varhandle variant of the above reflection based model is probably more appropriate. I spelled this with the reflection form for readability by pre-varhandles-speakers).

Neither of these forms require any specific JIT matching or exploration. We know the first form is fairly robust on architectures that support stuff like PAUSE. The second form will probably be robust both architectures that support MWAIT or WFE, and on those that support PAUSE (those just won't watch anything).

On how this differs from a single boolean parameter: My notion (in the example above) of a single poll variable would be one that specifically designates the poll variable as a field (or maybe array index as an option), rather than provide a boolean parameter that is potentially evaluated based on data read from more than one memory location.

The issue is that while it's an easy fit if the boolean is computed based on evaluating a single address, it becomes fragile if multiple addresses are involved and the hardware can only watch one (which is the current trend for ARM v8, SPARC M7, and a potential MONITOR/WAIT x86). It would be "hard" for a JIT to figure out which of the addresses read to compute the bollean should be watched in the spin. And getting it wrong can have potentially surprising consequences (not just lack of benefit, but terribly slow execution due to waiting for something that is not going to be externally modified and timing out each time before spinning).

e.g. these probably look good to a programmer:

while (!pollSpinExit(done1 || done 2 || (count++ > limit)) {
}

And it could translate to the following rough mixed pseudo code:

	SEVL
loop:
	WFE
	ldaxrh %done1, [done]	
	if (!(%done1 || done2 || (count++ > limit)) goto loop:
	…

But it could also be translated to:

	SEVL
loop:
	WFE	
	ldaxrh %done2, [done]	
	if (!(done1 || %done2 || (count++ > limit)) goto loop:
	…

(or a third option that decides to watch count instead).

None of these are "right". And there is nothing in the semantics that suggests which one to expect.

You could fall back and say that you would only get the benefit if there is exactly one address used in deriving the boolean, but this would probably make it hard to code to and maintain. A form that forces you to specific the polling parameter would be less generic in expression, but will be less fragile to program to as well, IMO.

> 
>> It is probably premature to design a generic way to provide addresses and/or state to this "spin until something interesting changes" stuff without looking at working examples. A single watched address API is much more likely to fit current implementations without being fragile.
>> 
>> ARM v8's WFE is probably the most real user-mode-accesible thing for this right now (MWAIT isn't real yet, as it's not accessible from user mode). We can look at an example of how a spinloop needs to coordinate the use of WFE, SEVL, and the evaluation of memory location with load exclusive operations here: http://lxr.free-electrons.com/source/arch/arm64/include/asm/spinlock.h . The tricky part is that the SEVL needs to immediately proceed the loop (and all accesses that need to be watched by the WFE), but can't be part of the loop (if were in the loop the WFE would always trigger immediately). But the code in the spinning loop can can only track a single address (the exclusive tag in load exclusive applies only the the most recent address used), so it would be wrong to allow generic code in the spin (it would have to be code that watches exactly one address). 
>> 
>> My suspicion is that the "right" way to capture the various ways a spin loop would need to interact with RFE logic will be different than tracking things that can generically affect the value of a boolean. E.g. the evaluation of the boolean could be based on multiple addresses, and since it's not clear (in the API) that this is a problem, the benefits derived would be fragile.
> 
> Having the JIT explore nearby loop structure for memory references is even more fragile.

Agreed. Which is why I'm not suggesting it.

> If we can agree that (a) there are advantages to profiling the boolean parameter for all platforms, and (b) the single-poll-variable case is likely to be optimizable sooner *with* a parameter than *without*, maybe this is enough to tip the scales towards boolean parameter.

I guess that's where we differ: I don't see a benefit in profiling the spin loop, so we disagree on (a). And hence (b) is not relevant…

Maybe I'm mis-reading what you mean by "profiling" and "optimizing" above?

> The idea would be that programmers would take a little extra thought when using yield(Z)Z, and get paid immediately from good profiling.  They would get paid again later if and when platforms analyze data dependencies on the Z.
> 
> If there's no initial payoff, then, yes, it is hard asking programmers to expend extra thought that only benefits on some platforrms.

Whatever the choices end up being, we could provide multiple signatures or APIs. E.g. I think that the no-args spinLoopHint() is the de-facto spinning model for x86 and Power (and have been for over a decade for everything outside of Java). So it's a safe bet and a natural form. The spin-execute-something-while-watching-a-single-address model is *probably* a good fit for some relatively young but very useful hardware capabilities, and can probably be captured in a long-lasting API as well.

More complicated boolean-derived-from-pretty-much-anything or multi-address watching schemes are IMO too early to evaluate. E.g. they could potentially leverage some just-around-the-corner (or recently arrived) features like TSX and NCAS schemes, but since there is relatively little experience with using such things for spinning (outside of Java), it is probably pre-mature to solidify a Java API for them.

BTW, even with user-mode MWAIT and cousins, and with the watch-a-single-address API forms, we may be looking at two separate motivations, and may want to consider a hint of which one is intended. E.g. one of spinLoopHint()'s main drivers is latency improvement,  and the other is power reduction (with potential speed benefits or just power savings benefits). It appears that on x86 a PAUSE provides both, so there is no choice needed there. But MWAIT may be much more of a power-centric approach that sacrifices latency, and that may be OK for some and un-OK for others. We may want to have API variants that allow a hint about whether power-reduction or latency-reduction is the preferred driver.

> 
> — John
>