8234160: ZGC: Enable optimized mitigation for Intel jcc erratum in C2 load barrier
Vladimir Ivanov
vladimir.x.ivanov at oracle.com
Fri Nov 22 13:22:13 UTC 2019
Hi Erik,
>>> That is a good question. Unfortunately, there are a few problems
>>> applying such a strategy:
>>>
>>> 1) We do not want to constrain the alignment such that the
>>> instruction (+ specific offset) sits at e.g. the beginning of a 32
>>> byte boundary. We want to be more loose and say that any alignment is
>>> fine... except the bad ones (crossing and ending at a 32 byte
>>> boundary). Otherwise I fear we will find ourselves bloating the code
>>> cache with unnecessary nops to align instructions that would never
>>> have been a problem. So in terms of alignment constraints, I think
>>> such a hammer is too big.
>>
>> It would be interesting to have some data on that one. Aligning 5-byte
>> instruction on 8-byte boundary wastes 3 bytes at most. For 10-byte
>> sequence it wastes 6 bytes at most which doesn't sound good.
>
> I think you missed one of my points (#2), which is that aligning single
> instructions is not enough to remedy the problem.
> For example, consider you have an add; jne; sequence. Let's say we
> decide on a magical alignment that we apply globally for
> jcc instructions.
> Assume that the jne was indeed aligned correctly and we insert no nops.
> Then it is not necessarily the case that the fused
> add + jne sequence has the desired alignment property as well (i.e. the
> add might cross 32 byte boundaries, tainting the
> macro fused micro code). Therefore, this will not work.
> I suppose that if you would always put in at least one nop before the
> jcc to break all macro fusions of branches globally,
> then you will be able to do that. But that seems like a larger hammer
> than what we need.
I was thinking about aligning macro-fused instruction sequences and not
individual jcc instructions. There are both automatic and cooperative
ways to detect them.
>>> 2) Another issue is that the alignment constraints apply not just to
>>> the one Mach node. It's sometimes for a fused op + jcc. Since we
>>> currently match the conditions and their branches separately (and the
>>> conditions not necessarily knowing they are indeed conditions to a
>>> branch, like for example an and instruction). So aligning the jcc for
>>> example is not necessarily going to help, unless its alignment knows
>>> what its preceding instruction is, and whether it will be fused or
>>> not. And depending on that, we want different alignment properties.
>>> So here the hammer is seemingly too loose.
>>
>> I mentioned MacroAssembler in previous email, because I don't consider
>> it as C2-specific problem. Stubs, interpreter, and C1 are also
>> affected and we need to fix them too (considering being on the edge of
>> cache line may cause unpredictable behavior).
>
> I disagree that this is a correctness fix. The correctness fix is for
> branches on 64 byte boundaries, and is being dealt with
> using micro code updates (that disables micro op caching of the
> problematic branch and fused branch micro ops). What
> we are dealing with here is mitigating the performance hit of Intel's
> correctness mitigation for the erratum, which involves
> branches and fused branches crossing or ending at 32 byte boundaries. In
> other words, the correctness is dealt with elsewhere,
> and we are optimizing the code to avoid regressions for performance
> sensitive branches, due to that correctness fix.
> Therefore, I think it is wise to focus the optimization efforts where it
> matters the most: C2.
Point taken. I agree that the focus should be on performance.
But I'd include stubs as well. Many of them are extensively used from
C2-generated code.
> I don't think that doing some analysis on the Mach nodes and injecting
> the padding only where we actually need it is too
> complicated in C2 (which I believe, at least for now, is where we should
> focus).
I'm curious is there anything special about Mach IR which
helps/simplifies the analysis?
One problem I see with it is that some mach nodes issue local branches
for their own purposes (inside MachNode::emit()) and you can't mitigate
such cases on the level of Mach IR.
> I have made a prototype, what this might look like and it looks like this:
> http://cr.openjdk.java.net/~eosterlund/8234160/webrev.01/
Just one more comment: it's weird to see intel_jcc_erratum referenced in
shared code. You could #ifdef it for x86-only, but it's much better to
move the code to x86-specific location.
Best regards,
Vladimir Ivanov
> Incremental:
> http://cr.openjdk.java.net/~eosterlund/8234160/webrev.00_01/
>
> The idea is pretty much what I said to Paul. There are 3 hooks needed:
> 1) Apply pessimistic size measurements during branch shortening on
> affected nodes
> 2) Analyze which nodes will fuse, and tag all affected mach nodes with a
> flag
> 3) When emitting the code, add required padding on the flagged nodes
> that end at or cross 32 byte boundaries.
>
> I haven't run exhaustive tests or measurements yet on this. I thought we
> should sync ideas so we agree about direction before I do too much.
>
> What do you think?
>
> Thanks,
> /Erik
>
>> Best regards,
>> Vladimir Ivanov
>>
>>> I'm not 100% sure what to suggest for the generic case, but perhaps:
>>>
>>> After things stopped moving around, add a pass to the Mach nodes,
>>> similar to branch shortening that:
>>>
>>> 1) Set up a new flag (Flags_intel_jcc_mitigation or something) to be
>>> used on Mach nodes to mark affected nodes.
>>> 2) Walk the Mach nodes and tag branches and conditions used by fused
>>> branches (by walking edges), checking that the two are adjacent (by
>>> looking at the node index in the block), and possibly also checking
>>> that it is one of the affected condition instructions that will get
>>> fused.
>>> 3) Now that we know what Mach nodes (and sequences of macro fused
>>> nodes) are problematic, we can put some code where the mach nodes are
>>> emitted that checks for consecutively tagged nodes and inject nops in
>>> the code buffer if they cross or end at 32 byte boundaries.
>>>
>>> I suppose an alternative strategy is making sure that any problematic
>>> instruction sequence that would be fused, is also fused into one Mach
>>> node by sprinkling more rules in the AD file for the various forms of
>>> conditional branches that we think cover all the cases, and then
>>> applying the alignment constraint on individual nodes only. But it
>>> feels like that could be more intrusive and less efficient).
>>>
>>> Since the generic problem is more involved compared to the simpler
>>> ZGC load barrier fix (which will need special treatment anyway), I
>>> would like to focus this RFE only on the ZGC load barrier branch,
>>> because it makes me sad when it has to suffer. Having said that, we
>>> will certainly look into fixing the generic problem too after this.
>>
>
More information about the hotspot-compiler-dev
mailing list