8234160: ZGC: Enable optimized mitigation for Intel jcc erratum in C2 load barrier

Thu Nov 21 11:12:53 UTC 2019

(Missed Paul's and your response when sending previous email.)

> That is a good question. Unfortunately, there are a few problems 
> applying such a strategy:
> 
> 1) We do not want to constrain the alignment such that the instruction 
> (+ specific offset) sits at e.g. the beginning of a 32 byte boundary. We 
> want to be more loose and say that any alignment is fine... except the 
> bad ones (crossing and ending at a 32 byte boundary). Otherwise I fear 
> we will find ourselves bloating the code cache with unnecessary nops to 
> align instructions that would never have been a problem. So in terms of 
> alignment constraints, I think such a hammer is too big.

It would be interesting to have some data on that one. Aligning 5-byte 
instruction on 8-byte boundary wastes 3 bytes at most. For 10-byte 
sequence it wastes 6 bytes at most which doesn't sound good.

> 2) Another issue is that the alignment constraints apply not just to the 
> one Mach node. It's sometimes for a fused op + jcc. Since we currently 
> match the conditions and their branches separately (and the conditions 
> not necessarily knowing they are indeed conditions to a branch, like for 
> example an and instruction). So aligning the jcc for example is not 
> necessarily going to help, unless its alignment knows what its preceding 
> instruction is, and whether it will be fused or not. And depending on 
> that, we want different alignment properties. So here the hammer is 
> seemingly too loose.

I mentioned MacroAssembler in previous email, because I don't consider 
it as C2-specific problem. Stubs, interpreter, and C1 are also affected 
and we need to fix them too (considering being on the edge of cache line 
may cause unpredictable behavior).

Detecting instruction sequencies is harder than aligning a single one, 
but still possible. And MacroAssembler can introduce a new "macro" 
instruction for conditional jumps which solves the detection problem 
once the code base migrate to it.

Best regards,
Vladimir Ivanov

> I'm not 100% sure what to suggest for the generic case, but perhaps:
> 
> After things stopped moving around, add a pass to the Mach nodes, 
> similar to branch shortening that:
> 
> 1) Set up a new flag (Flags_intel_jcc_mitigation or something) to be 
> used on Mach nodes to mark affected nodes.
> 2) Walk the Mach nodes and tag branches and conditions used by fused 
> branches (by walking edges), checking that the two are adjacent (by 
> looking at the node index in the block), and possibly also checking that 
> it is one of the affected condition instructions that will get fused.
> 3) Now that we know what Mach nodes (and sequences of macro fused nodes) 
> are problematic, we can put some code where the mach nodes are emitted 
> that checks for consecutively tagged nodes and inject nops in the code 
> buffer if they cross or end at 32 byte boundaries.
> 
> I suppose an alternative strategy is making sure that any problematic 
> instruction sequence that would be fused, is also fused into one Mach 
> node by sprinkling more rules in the AD file for the various forms of 
> conditional branches that we think cover all the cases, and then 
> applying the alignment constraint on individual nodes only. But it feels 
> like that could be more intrusive and less efficient).
> 
> Since the generic problem is more involved compared to the simpler ZGC 
> load barrier fix (which will need special treatment anyway), I would 
> like to focus this RFE only on the ZGC load barrier branch, because it 
> makes me sad when it has to suffer. Having said that, we will certainly 
> look into fixing the generic problem too after this.