RFR(XL): 8185640: Thread-local handshakes

Fri Oct 27 07:11:32 UTC 2017

Hi Andrew,

On 2017-10-26 19:19, Andrew Haley wrote:
> On 26/10/17 18:00, Erik Osterlund wrote:
>> Hi Andrew,
>>
>>> On 26 Oct 2017, at 18:05, Andrew Haley <aph at redhat.com> wrote:
>>>
>>>> On 26/10/17 15:39, Erik Österlund wrote:
>>>>
>>>> The reason we do not poll the page in the interpreter is that we
>>>> need to generate appropriate relocation entries in the code blob for
>>>> the PCs that we poll on, so that we in the signal handler can look
>>>> up the code blob, walk the relocation entries, and find precisely
>>>> why we got the trap, i.e. due to the poll, and precisely what kind
>>>> of poll, so we know what trampoline needs to be taken into the
>>>> runtime.
>>> Not really, no.  If we know that we're in the interpreter and the
>>> faulting address is the safepoint poll, then we can read all of the
>>> context we need from the interpreter registers and the frame.
>> That sounds like what I said.
> Not exactly.  We do not need to generate any more relocation entries.

Maybe.

>> But the cost of the conditional branch is empirically (this was
>> attempted and measured a while ago) approximately the same as the
>> indirect load during "normal circumstances". The indirect load was
>> only marginally better.
> That's interesting.  The cost of the SEGV trap going through the
> kernel is fairly high, and I'm now wondering if, for very fast
> safepoint responses, we'd be better off not doing it.  The cost of the
> write protect, given that it probably involves an IPI on all
> processors, isn't cheap either.

The current mechanism does not use mprotect to stop threads. It has one 
global trapping page and one global not trapping page. It simply 
performs stores to flip the polling word to point at the trapping page. 
So I am not so concerned about TLB shootdown costs here.
As for the SEGV, the mechanism was stress tested (shooting handshakes on 
all threads continuously) to see how expensive the SEGV was, and the 
outcome was that it was surprisingly cheap. So we did not pursue making 
the slow path faster.

>
>>>> While constructing something that does that is indeed possible, it
>>>> simply did not seem worth the trouble compared to using a branch in
>>>> these paths. The same reasoning applies for the poll performed in
>>>> the native wrapper when waking up from native and transitioning into
>>>> Java. It performs a conditional branch instead of indirect load to
>>>> avoid signal handler logic for polls that are not performance
>>>> critical.
>>> If we're talking about performance, the existing bytecode interpreter
>>> is exquisitely carefully coded, even going to the extent of having
>>> multiple dispatch tables for safepoint- and non-safepoint cases.
>>> Clearly the original authors weren't thinking that code was not
>>> performance critical or they wouldn't have done what they did.  I
>>> suppose, though, that the design we have is from the early days when
>>> people diligently strove to make the interpreter as fast as possible.
>> On the other hand, branches have become a lot faster in "recent"
>> years, and this one is particularly trivial to predict. Therefore I
>> prefer to base design decisions on empirical measurements. And
>> introducing that complexity for an close to insignificantly faster
>> interpreter poll does not seem encouraging to me. Do you agree?
> Perhaps.  It's interesting that the result falls one way in compiled
> code and the other in interpreted code.  If the choice is so very
> finely balanced, though, it sort-of makes sense.

Yeah. I wrote about that decision to use indirect load instead of 
conditional branch in compiled code in an email to Paul if you are 
interested.

Thanks,
/Erik