8234160: ZGC: Enable optimized mitigation for Intel jcc erratum in C2 load barrier

Wed Feb 19 16:20:20 UTC 2020

Hi Vladimir,

Thank you for the review.

On 2/19/20 4:00 PM, Vladimir Ivanov wrote:
> Hi Erik,
>
>> I have talked to Vladimir, and I think we at this point in time agree 
>> about doing this.
>
> Yes, we agreed that it's reasonable to take Erik's implementation as 
> the stop-the-gap fix and come up with a more comprehensive fix later. 
> Intel folks have been experimenting with MacroAssembler-based solution 
> and the results are promising. But it needs more time.
>
>> As a reminder, here is my webrev that I proposed (+ rebase that I 
>> just made):
>> http://cr.openjdk.java.net/~eosterlund/8234160/webrev.01/
>
> Can x86-specific changes in output.cpp & node.hpp be moved to a 
> arch-specific location? x86.ad maybe?

The node.hpp changes are just allowing a flag bit for the mitigation. I 
can't really move that code.
In output.cpp, I only have the hooks to the platform specific code in 
there. There are 3 hooks,
and I need them. The code that actually does something is in platform 
specific code though.

> Have you considered extending MachNode::compute_padding() to do the job?

I have. The MachNode::compute_padding() function virtual and allows 
individual nodes to request padding.
The padding I apply is not a per-node property. It concerns consecutive 
nodes, due to macro fusion. So
it did not seem like a good fit, and due to the virtual nature, it would 
be messy to get it right.

I also intentionally want to retain the meaning of that per-node 
information, to be JCC-erratum invariant.
That allows me to actually use it to assert that the node itself does 
not expect a padding other than the
one I am enforcing due to the JCC erratum. This allows me to catch bugs 
easily where the JCC erratum padding
applied goes against the expectations of the node, enforcing that 
expectations on both ends are honoured.

There is already other code for applying padding that is not 
node-specific, such as the avoid_back_to_back()
logic, and this optimization seemed in spirit closer to that, as it uses 
the index in the block. So
that is why I solved it in a similar way.

Thanks,
/Erik

> Best regards,
> Vladimir Ivanov
>
>> On 11/25/19 4:31 PM, Vladimir Ivanov wrote:
>>> Hi Erik,
>>>
>>>>> But I'd include stubs as well. Many of them are extensively used 
>>>>> from C2-generated code.
>>>>
>>>> Okay. Any specific stubs you have in mind?If there are some 
>>>> critical ones, we can sprinkle some scope objects like I did in the 
>>>> ZGC code.
>>>
>>> There are intrinsics for compressed strings [1], numerous copy stubs 
>>> [2], trigonometric functions [3].
>>>
>>> It would be unfortunate if we have to go over all that code and 
>>> manually instrument all the places where problematic instructions 
>>> are issued. Moreover, the process has to be repeated for new code 
>>> being added over time.
>>>
>>>> I do have concerns though about injecting magic into the 
>>>> MacroAssembler that tries to solve this automagically on the 
>>>> assembly level, by having the assembler spit out different
>>>> instructions than you requested.
>>>> The following comment from assembler.hpp captures my thought exactly:
>>>>
>>>> 207: // The Abstract Assembler: Pure assembler doing NO 
>>>> optimizations on the
>>>> 208: // instruction level; i.e., what you write is what you get.
>>>> 209: // The Assembler is generating code into a CodeBuffer.
>>>
>>> While I see that Assembler follows that (instruction per method), 
>>> MacroAssembler does not: there are cases when generated code differ 
>>> depending on runtime flags (e.g., verification code) or input values 
>>> (e.g., whether AddressLiteral is reachable or not).
>>>
>>>> I think it is desirable to keep the property that when we tell the 
>>>> *Assembler to generate a __ cmp(); __ jcc(); it will do exactly that.
>>>> When such assumptions break, any code that has calculated the size 
>>>> of instructions, making assumptions about their size, will fail.
>>>> For example, any MachNode with hardcoded size() might underestimate 
>>>> how much memory is really needed, and code such as nmethod entry 
>>>> barriers
>>>> that have calculated the offset to the cmp immediate might suddenly 
>>>> stop working because. There is similar code for oop maps where we
>>>> calculate offsets into mach nodes with oop maps to describe the PC 
>>>> after a call, which will stop working:
>>>>
>>>> // !!!!! Special hack to get all types of calls to specify the byte 
>>>> offset
>>>> //       from the start of the call to the point where the return 
>>>> address
>>>> //       will point.
>>>> int MachCallStaticJavaNode::ret_addr_offset()
>>>> {
>>>>    int offset = 5; // 5 bytes from start of call to where return 
>>>> address points
>>>>    offset += clear_avx_size();
>>>>    return offset;
>>>> }
>>>>
>>>> Basically, I think you might be able to mitigate more branches on 
>>>> the MacroAssembler layer, but I think it would also be more risky, 
>>>> as code that was
>>>> not built for having random size will start failing, in places we 
>>>> didn't think of.I can think of a few, and feel like there are 
>>>> probably other places I have not thought about.
>>>>
>>>> So from that point of view, I think I would rather to this on Mach 
>>>> nodes where it is safe, and I think we can catch the most important 
>>>> ones there,
>>>> and miss a few branches that the macro assembler would have found 
>>>> with magic, than apply it to all branches and hope we find all the 
>>>> bugs due to unexpected magic.
>>>>
>>>> Do you agree? Or perhaps I misunderstood what you are suggesting.
>>>
>>> You raise a valid point: there are places in the VM which rely on 
>>> hard-coded instruction sequences. If such instruction changes, all 
>>> relevant places have to be adjusted. And JVM is already very 
>>> cautious about such cases.
>>>
>>> I agree with you that MacroAssembler-based more risky, but IMO the 
>>> risk is modest (few places are affected) and manageable (dedicated 
>>> stress mode should greatly improve test effectiveness).
>>>
>>> My opinion is that if we are satisfied with the coverage C2 CFG 
>>> instrumentation provides and don't expect any more work on 
>>> mitigations, then there's no motivation in investing into 
>>> MacroAssembler-based approach.
>>>
>>> Otherwise, there are basically 2 options:
>>>
>>>   * "opt-in": explicitly mark all the places where mitigations are 
>>> applied, by default nothing is mitigated
>>>
>>>   * "opt-out": mitigate everything unless mitigations are explicitly 
>>> disabled
>>>
>>> Both approaches provide fine-grained control over what's being 
>>> mitigated, but with "opt-out" there's more code to care about: it's 
>>> easy to miss important cases and too tempting to enable more than we 
>>> are 100% certain about.
>>>
>>> Both can be applied to individual CFG nodes and make CFG 
>>> instrumentation redundant.
>>>
>>> But if there's a need to instrument large portions of 
>>> (macro)assembly code, then IMO opt-in adds too much in terms of work 
>>> required, noise (on code level), maintenance, and burden for future 
>>> code changes. So, I don't consider it as a feasible option in such 
>>> situation.
>>>
>>> It looks like a mixture of opt-in (explicitly enable in some 
>>> context: in C2 during code emission, particular stub generation, 
>>> etc) and opt-out (on the level of individual instructions) gives the 
>>> best of both approaches.
>>>
>>> But, again, if C2 CFG instrumentation is good enough, then it'll be 
>>> a wasted effort.
>>>
>>> So, I envision 3 possible scenarios:
>>>
>>>   (1) just instrument Mach IR and be done with it;
>>>
>>>   (2) (a) start with Mach IR;
>>>       (b) later it turns out that extensive portions of 
>>> (macro)assembly code have to me instrumented (or, for example, 
>>> C1/Interpreter)
>>>       (c) implement MacroAssembler mitigations
>>>
>>>   (3) start with MacroAssembler mitigations and be done with it
>>>      * doesn't perclude gradual roll out across different subsystems
>>>
>>> Mach IR instrumentation (#1/#2) is the safest variant, but it may 
>>> require more work.
>>>
>>> #3 is broadly applicable, but also riskier.
>>>
>>> What I don't consider as a viable option is C2 CFG instrumentation 
>>> accompanied by numerous per-instruction mitigations scattered across 
>>> the code base.
>>>
>>>>>> I have made a prototype, what this might look like and it looks 
>>>>>> like this:
>>>>>> http://cr.openjdk.java.net/~eosterlund/8234160/webrev.01/
>>>>>
>>>>> Just one more comment: it's weird to see intel_jcc_erratum 
>>>>> referenced in shared code. You could #ifdef it for x86-only, but 
>>>>> it's much better to move the code to x86-specific location.
>>>>
>>>> Sure, I can move that to an x86 file and make it build only on x86_64.
>>>
>>> Yes, sounds good. But let's agree on general direction first.
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>> [1] 
>>> http://hg.openjdk.java.net/jdk/jdk/file/tip/src/hotspot/cpu/x86/macroAssembler_x86.hpp#l1666 
>>>
>>>
>>> [2] 
>>> http://hg.openjdk.java.net/jdk/jdk/file/623722a6aeb9/src/hotspot/cpu/x86/stubGenerator_x86_64.cpp 
>>>
>>>
>>> [3] http://hg.openjdk.java.net/jdk/jdk/file/tip/src/hotspot/cpu/x86/
>>>     macroAssembler_x86_(sin|cos|...).cpp
>>>
>>