8234160: ZGC: Enable optimized mitigation for Intel jcc erratum in C2 load barrier

Fri Nov 22 14:23:29 UTC 2019

Hi Vladimir,

On 11/22/19 2:22 PM, Vladimir Ivanov wrote:
> Hi Erik,
>>>> That is a good question. Unfortunately, there are a few problems
>>>> applying such a strategy:
>>>>
>>>> 1) We do not want to constrain the alignment such that the 
>>>> instruction (+ specific offset) sits at e.g. the beginning of a 32 
>>>> byte boundary. We want to be more loose and say that any alignment 
>>>> is fine... except the bad ones (crossing and ending at a 32 byte 
>>>> boundary). Otherwise I fear we will find ourselves bloating the 
>>>> code cache with unnecessary nops to align instructions that would 
>>>> never have been a problem. So in terms of alignment constraints, I 
>>>> think such a hammer is too big.
>>>
>>> It would be interesting to have some data on that one. Aligning 
>>> 5-byte instruction on 8-byte boundary wastes 3 bytes at most. For 
>>> 10-byte sequence it wastes 6 bytes at most which doesn't sound good.
>>
>> I think you missed one of my points (#2), which is that aligning 
>> single instructions is not enough to remedy the problem.
>> For example, consider you have an add; jne; sequence. Let's say we 
>> decide on a magical alignment that we apply globally for
>> jcc instructions.
>> Assume that the jne was indeed aligned correctly and we insert no 
>> nops. Then it is not necessarily the case that the fused
>> add + jne sequence has the desired alignment property as well (i.e. 
>> the add might cross 32 byte boundaries, tainting the
>> macro fused micro code). Therefore, this will not work.
>> I suppose that if you would always put in at least one nop before the 
>> jcc to break all macro fusions of branches globally,
>> then you will be able to do that. But that seems like a larger hammer 
>> than what we need.
>
> I was thinking about aligning macro-fused instruction sequences and 
> not individual jcc instructions. There are both automatic and 
> cooperative ways to detect them.

Okay.

>
>>>> 2) Another issue is that the alignment constraints apply not just 
>>>> to the one Mach node. It's sometimes for a fused op + jcc. Since we 
>>>> currently match the conditions and their branches separately (and 
>>>> the conditions not necessarily knowing they are indeed conditions 
>>>> to a branch, like for example an and instruction). So aligning the 
>>>> jcc for example is not necessarily going to help, unless its 
>>>> alignment knows what its preceding instruction is, and whether it 
>>>> will be fused or not. And depending on that, we want different 
>>>> alignment properties. So here the hammer is seemingly too loose.
>>>
>>> I mentioned MacroAssembler in previous email, because I don't 
>>> consider it as C2-specific problem. Stubs, interpreter, and C1 are 
>>> also affected and we need to fix them too (considering being on the 
>>> edge of cache line may cause unpredictable behavior).
>>
>> I disagree that this is a correctness fix. The correctness fix is for 
>> branches on 64 byte boundaries, and is being dealt with
>> using micro code updates (that disables micro op caching of the 
>> problematic branch and fused branch micro ops). What
>> we are dealing with here is mitigating the performance hit of Intel's 
>> correctness mitigation for the erratum, which involves
>> branches and fused branches crossing or ending at 32 byte boundaries. 
>> In other words, the correctness is dealt with elsewhere,
>> and we are optimizing the code to avoid regressions for performance 
>> sensitive branches, due to that correctness fix.
>> Therefore, I think it is wise to focus the optimization efforts where 
>> it matters the most: C2.
>
> Point taken. I agree that the focus should be on performance.
>
> But I'd include stubs as well. Many of them are extensively used from 
> C2-generated code.

Okay. Any specific stubs you have in mind?If there are some critical 
ones, we can sprinkle some scope objects like I did in the ZGC code.

>
>> I don't think that doing some analysis on the Mach nodes and 
>> injecting the padding only where we actually need it is too
>> complicated in C2 (which I believe, at least for now, is where we 
>> should focus).
>
> I'm curious is there anything special about Mach IR which 
> helps/simplifies the analysis?

Yeah, the fact that you can walk the graph and tag the problematic 
combinations of mach nodes in one pass, and then fuse the alignment with 
that knowledge when code is emitted.

> One problem I see with it is that some mach nodes issue local branches 
> for their own purposes (inside MachNode::emit()) and you can't 
> mitigate such cases on the level of Mach IR.

Sure. Things like the MacroAssembler::fast_lock which is directly 
injected into the emission of a Mach node will have some internal 
branches that my analysis will not find.
I do have concerns though about injecting magic into the MacroAssembler 
that tries to solve this automagically on the assembly level, by having 
the assembler spit out different
instructions than you requested.
The following comment from assembler.hpp captures my thought exactly:

207: // The Abstract Assembler: Pure assembler doing NO optimizations on the
208: // instruction level; i.e., what you write is what you get.
209: // The Assembler is generating code into a CodeBuffer.

I think it is desirable to keep the property that when we tell the 
*Assembler to generate a __ cmp(); __ jcc(); it will do exactly that.
When such assumptions break, any code that has calculated the size of 
instructions, making assumptions about their size, will fail.
For example, any MachNode with hardcoded size() might underestimate how 
much memory is really needed, and code such as nmethod entry barriers
that have calculated the offset to the cmp immediate might suddenly stop 
working because. There is similar code for oop maps where we
calculate offsets into mach nodes with oop maps to describe the PC after 
a call, which will stop working:

// !!!!! Special hack to get all types of calls to specify the byte offset
//       from the start of the call to the point where the return address
//       will point.
int MachCallStaticJavaNode::ret_addr_offset()
{
   int offset = 5; // 5 bytes from start of call to where return address 
points
   offset += clear_avx_size();
   return offset;
}

Basically, I think you might be able to mitigate more branches on the 
MacroAssembler layer, but I think it would also be more risky, as code 
that was
not built for having random size will start failing, in places we didn't 
think of.I can think of a few, and feel like there are probably other 
places I have not thought about.

So from that point of view, I think I would rather to this on Mach nodes 
where it is safe, and I think we can catch the most important ones there,
and miss a few branches that the macro assembler would have found with 
magic, than apply it to all branches and hope we find all the bugs due 
to unexpected magic.

Do you agree? Or perhaps I misunderstood what you are suggesting.

>> I have made a prototype, what this might look like and it looks like 
>> this:
>> http://cr.openjdk.java.net/~eosterlund/8234160/webrev.01/
>
> Just one more comment: it's weird to see intel_jcc_erratum referenced 
> in shared code. You could #ifdef it for x86-only, but it's much better 
> to move the code to x86-specific location.

Sure, I can move that to an x86 file and make it build only on x86_64.

Thanks,
/Erik