8234160: ZGC: Enable optimized mitigation for Intel jcc erratum in C2 load barrier

Thu Nov 21 15:51:05 UTC 2019

Hi Vladimir,

Thank you for reviewing this patch.

On 11/21/19 12:12 PM, Vladimir Ivanov wrote:
> (Missed Paul's and your response when sending previous email.)
>
>> That is a good question. Unfortunately, there are a few problems 
>> applying such a strategy:
>>
>> 1) We do not want to constrain the alignment such that the 
>> instruction (+ specific offset) sits at e.g. the beginning of a 32 
>> byte boundary. We want to be more loose and say that any alignment is 
>> fine... except the bad ones (crossing and ending at a 32 byte 
>> boundary). Otherwise I fear we will find ourselves bloating the code 
>> cache with unnecessary nops to align instructions that would never 
>> have been a problem. So in terms of alignment constraints, I think 
>> such a hammer is too big.
>
> It would be interesting to have some data on that one. Aligning 5-byte 
> instruction on 8-byte boundary wastes 3 bytes at most. For 10-byte 
> sequence it wastes 6 bytes at most which doesn't sound good.

I think you missed one of my points (#2), which is that aligning single 
instructions is not enough to remedy the problem.
For example, consider you have an add; jne; sequence. Let's say we 
decide on a magical alignment that we apply globally for
jcc instructions.
Assume that the jne was indeed aligned correctly and we insert no nops. 
Then it is not necessarily the case that the fused
add + jne sequence has the desired alignment property as well (i.e. the 
add might cross 32 byte boundaries, tainting the
macro fused micro code). Therefore, this will not work.
I suppose that if you would always put in at least one nop before the 
jcc to break all macro fusions of branches globally,
then you will be able to do that. But that seems like a larger hammer 
than what we need.

>
>> 2) Another issue is that the alignment constraints apply not just to 
>> the one Mach node. It's sometimes for a fused op + jcc. Since we 
>> currently match the conditions and their branches separately (and the 
>> conditions not necessarily knowing they are indeed conditions to a 
>> branch, like for example an and instruction). So aligning the jcc for 
>> example is not necessarily going to help, unless its alignment knows 
>> what its preceding instruction is, and whether it will be fused or 
>> not. And depending on that, we want different alignment properties. 
>> So here the hammer is seemingly too loose.
>
> I mentioned MacroAssembler in previous email, because I don't consider 
> it as C2-specific problem. Stubs, interpreter, and C1 are also 
> affected and we need to fix them too (considering being on the edge of 
> cache line may cause unpredictable behavior).

I disagree that this is a correctness fix. The correctness fix is for 
branches on 64 byte boundaries, and is being dealt with
using micro code updates (that disables micro op caching of the 
problematic branch and fused branch micro ops). What
we are dealing with here is mitigating the performance hit of Intel's 
correctness mitigation for the erratum, which involves
branches and fused branches crossing or ending at 32 byte boundaries. In 
other words, the correctness is dealt with elsewhere,
and we are optimizing the code to avoid regressions for performance 
sensitive branches, due to that correctness fix.
Therefore, I think it is wise to focus the optimization efforts where it 
matters the most: C2.

> Detecting instruction sequencies is harder than aligning a single one, 
> but still possible. And MacroAssembler can introduce a new "macro" 
> instruction for conditional jumps which solves the detection problem 
> once the code base migrate to it.

Maybe. As I said, alignment is not enough, because it does not catch 
problematic macro fusing. We would have to do something
more drastic like intentionally breaking all macro fusion by putting 
leading nops regardless of alignment in branches. I'm not
sure what I think about that, given that we do this to optimize the code 
(and not as a correctness fix).

I don't think that doing some analysis on the Mach nodes and injecting 
the padding only where we actually need it is too
complicated in C2 (which I believe, at least for now, is where we should 
focus).

I have made a prototype, what this might look like and it looks like this:
http://cr.openjdk.java.net/~eosterlund/8234160/webrev.01/

Incremental:
http://cr.openjdk.java.net/~eosterlund/8234160/webrev.00_01/

The idea is pretty much what I said to Paul. There are 3 hooks needed:
1) Apply pessimistic size measurements during branch shortening on 
affected nodes
2) Analyze which nodes will fuse, and tag all affected mach nodes with a 
flag
3) When emitting the code, add required padding on the flagged nodes 
that end at or cross 32 byte boundaries.

I haven't run exhaustive tests or measurements yet on this. I thought we 
should sync ideas so we agree about direction before I do too much.

What do you think?

Thanks,
/Erik

> Best regards,
> Vladimir Ivanov
>
>> I'm not 100% sure what to suggest for the generic case, but perhaps:
>>
>> After things stopped moving around, add a pass to the Mach nodes, 
>> similar to branch shortening that:
>>
>> 1) Set up a new flag (Flags_intel_jcc_mitigation or something) to be 
>> used on Mach nodes to mark affected nodes.
>> 2) Walk the Mach nodes and tag branches and conditions used by fused 
>> branches (by walking edges), checking that the two are adjacent (by 
>> looking at the node index in the block), and possibly also checking 
>> that it is one of the affected condition instructions that will get 
>> fused.
>> 3) Now that we know what Mach nodes (and sequences of macro fused 
>> nodes) are problematic, we can put some code where the mach nodes are 
>> emitted that checks for consecutively tagged nodes and inject nops in 
>> the code buffer if they cross or end at 32 byte boundaries.
>>
>> I suppose an alternative strategy is making sure that any problematic 
>> instruction sequence that would be fused, is also fused into one Mach 
>> node by sprinkling more rules in the AD file for the various forms of 
>> conditional branches that we think cover all the cases, and then 
>> applying the alignment constraint on individual nodes only. But it 
>> feels like that could be more intrusive and less efficient).
>>
>> Since the generic problem is more involved compared to the simpler 
>> ZGC load barrier fix (which will need special treatment anyway), I 
>> would like to focus this RFE only on the ZGC load barrier branch, 
>> because it makes me sad when it has to suffer. Having said that, we 
>> will certainly look into fixing the generic problem too after this.
>