8234160: ZGC: Enable optimized mitigation for Intel jcc erratum in C2 load barrier

Fri Nov 22 13:22:13 UTC 2019

Hi Erik,
>>> That is a good question. Unfortunately, there are a few problems
>>> applying such a strategy:
>>>
>>> 1) We do not want to constrain the alignment such that the 
>>> instruction (+ specific offset) sits at e.g. the beginning of a 32 
>>> byte boundary. We want to be more loose and say that any alignment is 
>>> fine... except the bad ones (crossing and ending at a 32 byte 
>>> boundary). Otherwise I fear we will find ourselves bloating the code 
>>> cache with unnecessary nops to align instructions that would never 
>>> have been a problem. So in terms of alignment constraints, I think 
>>> such a hammer is too big.
>>
>> It would be interesting to have some data on that one. Aligning 5-byte 
>> instruction on 8-byte boundary wastes 3 bytes at most. For 10-byte 
>> sequence it wastes 6 bytes at most which doesn't sound good.
> 
> I think you missed one of my points (#2), which is that aligning single 
> instructions is not enough to remedy the problem.
> For example, consider you have an add; jne; sequence. Let's say we 
> decide on a magical alignment that we apply globally for
> jcc instructions.
> Assume that the jne was indeed aligned correctly and we insert no nops. 
> Then it is not necessarily the case that the fused
> add + jne sequence has the desired alignment property as well (i.e. the 
> add might cross 32 byte boundaries, tainting the
> macro fused micro code). Therefore, this will not work.
> I suppose that if you would always put in at least one nop before the 
> jcc to break all macro fusions of branches globally,
> then you will be able to do that. But that seems like a larger hammer 
> than what we need.

I was thinking about aligning macro-fused instruction sequences and not 
individual jcc instructions. There are both automatic and cooperative 
ways to detect them.

>>> 2) Another issue is that the alignment constraints apply not just to 
>>> the one Mach node. It's sometimes for a fused op + jcc. Since we 
>>> currently match the conditions and their branches separately (and the 
>>> conditions not necessarily knowing they are indeed conditions to a 
>>> branch, like for example an and instruction). So aligning the jcc for 
>>> example is not necessarily going to help, unless its alignment knows 
>>> what its preceding instruction is, and whether it will be fused or 
>>> not. And depending on that, we want different alignment properties. 
>>> So here the hammer is seemingly too loose.
>>
>> I mentioned MacroAssembler in previous email, because I don't consider 
>> it as C2-specific problem. Stubs, interpreter, and C1 are also 
>> affected and we need to fix them too (considering being on the edge of 
>> cache line may cause unpredictable behavior).
> 
> I disagree that this is a correctness fix. The correctness fix is for 
> branches on 64 byte boundaries, and is being dealt with
> using micro code updates (that disables micro op caching of the 
> problematic branch and fused branch micro ops). What
> we are dealing with here is mitigating the performance hit of Intel's 
> correctness mitigation for the erratum, which involves
> branches and fused branches crossing or ending at 32 byte boundaries. In 
> other words, the correctness is dealt with elsewhere,
> and we are optimizing the code to avoid regressions for performance 
> sensitive branches, due to that correctness fix.
> Therefore, I think it is wise to focus the optimization efforts where it 
> matters the most: C2.

Point taken. I agree that the focus should be on performance.

But I'd include stubs as well. Many of them are extensively used from 
C2-generated code.

> I don't think that doing some analysis on the Mach nodes and injecting 
> the padding only where we actually need it is too
> complicated in C2 (which I believe, at least for now, is where we should 
> focus).

I'm curious is there anything special about Mach IR which 
helps/simplifies the analysis?

One problem I see with it is that some mach nodes issue local branches 
for their own purposes (inside MachNode::emit()) and you can't mitigate 
such cases on the level of Mach IR.

> I have made a prototype, what this might look like and it looks like this:
> http://cr.openjdk.java.net/~eosterlund/8234160/webrev.01/

Just one more comment: it's weird to see intel_jcc_erratum referenced in 
shared code. You could #ifdef it for x86-only, but it's much better to 
move the code to x86-specific location.

Best regards,
Vladimir Ivanov

> Incremental:
> http://cr.openjdk.java.net/~eosterlund/8234160/webrev.00_01/
> 
> The idea is pretty much what I said to Paul. There are 3 hooks needed:
> 1) Apply pessimistic size measurements during branch shortening on 
> affected nodes
> 2) Analyze which nodes will fuse, and tag all affected mach nodes with a 
> flag
> 3) When emitting the code, add required padding on the flagged nodes 
> that end at or cross 32 byte boundaries.
> 
> I haven't run exhaustive tests or measurements yet on this. I thought we 
> should sync ideas so we agree about direction before I do too much.
> 
> What do you think?

> 
> Thanks,
> /Erik
> 
>> Best regards,
>> Vladimir Ivanov
>>
>>> I'm not 100% sure what to suggest for the generic case, but perhaps:
>>>
>>> After things stopped moving around, add a pass to the Mach nodes, 
>>> similar to branch shortening that:
>>>
>>> 1) Set up a new flag (Flags_intel_jcc_mitigation or something) to be 
>>> used on Mach nodes to mark affected nodes.
>>> 2) Walk the Mach nodes and tag branches and conditions used by fused 
>>> branches (by walking edges), checking that the two are adjacent (by 
>>> looking at the node index in the block), and possibly also checking 
>>> that it is one of the affected condition instructions that will get 
>>> fused.
>>> 3) Now that we know what Mach nodes (and sequences of macro fused 
>>> nodes) are problematic, we can put some code where the mach nodes are 
>>> emitted that checks for consecutively tagged nodes and inject nops in 
>>> the code buffer if they cross or end at 32 byte boundaries.
>>>
>>> I suppose an alternative strategy is making sure that any problematic 
>>> instruction sequence that would be fused, is also fused into one Mach 
>>> node by sprinkling more rules in the AD file for the various forms of 
>>> conditional branches that we think cover all the cases, and then 
>>> applying the alignment constraint on individual nodes only. But it 
>>> feels like that could be more intrusive and less efficient).
>>>
>>> Since the generic problem is more involved compared to the simpler 
>>> ZGC load barrier fix (which will need special treatment anyway), I 
>>> would like to focus this RFE only on the ZGC load barrier branch, 
>>> because it makes me sad when it has to suffer. Having said that, we 
>>> will certainly look into fixing the generic problem too after this.
>>
>