RFR: 8210498: nmethod entry barriers

Mon Oct 8 12:54:35 UTC 2018

Hi Andrew,

On 2018-10-08 11:31, Andrew Haley wrote:
> On 10/05/2018 04:07 PM, Erik Österlund wrote:
>> This implementation is for x86_64 and is based on the prototyping
>> work of Rickard Bäckman. So big thanks to Rickard, who will
>> co-author of this patch.
>> The way it works is that there is a cmp #offset(r15_thread),
>> simm32_epoch); je #continuation; inserted into the verified entry of
>> nmethods that conditionally calls a stub trampolining into the VM.
>> When the simm32_epoch is not equal to some TLS-local field, then a
>> slow path is taken to run some barrier inside of the VM.  By
>> changing this TLS-local field in the safepoint, all verified entries
>> will take the slow path into the VM when run, where the barrier is
>> eventually disarmed by patching the simm32_epoch value of the
>> instruction stream to the current globally correct value.
> OK. And because it's only triggered on a safepoint there is no need for
> any memory fences.

Correct.

>> For the curious reader thinking that this might be an expensive
>> operation, it is not. This check in the nmethod verified entry is
>> very predictable and cheap. And because of use of inlining, the
>> checks become more infrequent. In my evaluation, it has gone
>> completely under the radar in benchmarking (running this with my
>> current version of concurrent class unloading in ZGC).
> I'm unconvinced. This change makes our entry/exit mechanism, already
> rather heavyweight, even more so. I'm aware of the "below the noise"
> argument but I don't think it's valid. Efficient systems are composed
> of thousands of tiny optimizations, each one of which is too small to
> produce any measurable benefit on its own. But they add up, or rather,
> they multiply.
>
> What's happening here, I suspect, is that branch prediction and
> speculation logic in the x86 can often do the whole
> lookahead/speculate cache line read/speculate branch sequence in
> parallel with whatever else the method needs to do. If so, that's a
> card you get to play, but only once.
>
> Not that this is an argument against the patch: if we need to do it,
> then we have to.

I do understand your concerns. I did not have a good intuition about the 
performance characteristics either. So I will start by answering the 
implicit question whether we need this or not. We do need it badly. We 
have been considering our options for a few years, and all paths led us 
here. We do not have a single model for even STW class unloading in ZGC 
that does not involve nmethod entry barriers.

Note that this is also an opt-in mechanism that you may choose to use it 
or not to use. I would like to convince you though that this may be 
useful for other GCs than ZGC as well. In fact, you may use it for 
optimization purposes, and hopefully buy back more performance than you 
lost. Here are a few possible optimization applications that nmethod 
entry barriers could enable:

1) Removing nmethod hotness counter maintenance causing a full system 
stack scanning operation every safepoint, in favour of marking nmethods 
that are being used in an entry barrier.
2) Patching GC barriers when changing phases. GC barriers like the 
pre-write barrier of Shenandoah and G1, are only necessary during 
concurrent marking. By patching the nmethods lazily, you can remove 
those unnecessary paths from the write barriers. The same argument 
applies for Shenandoah copy-on-write barriers that we know can not 
trigger after the relocation phase is done.
3) Other GCs might also want to utilize the concurrent class (and 
nmethod) unloading mechanisms that we introduce using nmethod entry 
barriers.

Also note that the implementation space of the barrier itself has some 
flexibility. Rickard's first prototype involved having an unconditional 
branch patched in over a nop. Since the nop is removed in the frontend, 
it seemed like the most conservative starting point. But since there was 
no measurable difference to the conditional branch, that was more 
favourable in the end, since it hade the additional advantage of not 
requiring a code cache walk in the safepoint. But if you have a platform 
where the trade off is not as obvious, both mechanisms could easily be 
supported.

Regarding your hypothesis why this works well, I agree. The very 
predictable nature of the branch combined with the speculation 
machinery, is probably very beneficial. And I think that since the 
introduction of 2 branch ports in the reservation stations, this type of 
code is handled even better by the hardware.

If I manage to persuade you to give this a shot on AArch64, I would be 
interested in hearing if further optimizations are required. In theory, 
you might be able to go even further and do VA tricks with thread stacks 
and bake this check into the stack banging, to both dodge the extra 
conditional branch, and not require code cache walks in safepoints. But 
that would have a much higher complexity budget, and would certainly 
require an observable problem to motivate a matching solution 
complexity, IMO.

Thanks,
/Erik