8234160: ZGC: Enable optimized mitigation for Intel jcc erratum in C2 load barrier

Fri Feb 21 17:56:56 UTC 2020

Hi Erik,

> The approach you describe seems to be (roughly):
> 1) Perform platform-specific analysis, injecting a new special nop mach 
> node
> 2) Perform platform-specific padding during code emission (yet hidden 
> behind shared-looking compute_padding hooks)

Yes, that's basically what I have in mind.

> But remember we have to take care of both initial blob sizing and branch 
> shortening, applying conservative size measurements, in addition
> to the real padding we apply in the end. Your platform-specific analysis 
> phase probably needs to do something before branch shortening,
> as branch shortening needs to know about the special alignment, at least 
> as a conservative estimate. But at this point, we do not yet know
> the real size of mach nodes. We only know that when they are emitted in 
> the code buffer. Only then do we apply the real padding, and the size
> of that padding depends on the adjacent nodes to the special mach 
> node.So we would seemingly have to do another pass through once sizes
> have calmed down after branch shortening, to perform another round of 
> analysis feeding adjacent mach node information into our special nops.

Yes, you raise very good points.

I missed the detail that padding relies both on 
MachNode::compute_padding() and MachNode::alignment_required(). And 
customizing both compute_padding() and alignment_required() doesn't look 
like a valid option anymore.

Speaking of custom mach nodes to represent padding, indeed 
Compile::shorten_branches() does replace the nodes we specifically care 
about: jumps.

And it leads to 2 passes:

   (1) insert custom nodes before shorten_branches() (technically enough 
information to estimate instruction sizes is available after RA is over).

   (2) after Compile::shorten_branches() (but before 
Compile::fill_buffer()) do one more pass and replace size estimates with 
exact sizes;

Regarding 1st pass, the only variance in size I see is due to short vs 
long jumps. Since you require no additional padding for marked nodes, 
the rest should stay the same.

FTR 2nd pass is optional: it's possible to use conservative estimates 
(just assume all jumps are long) when deciding whether to pad or not 
(and how much), but considering the overheads of additional padding we 
observed, probably, it's not the best decision from performance perspective.

IMO 2 passes are enough, but in case it's not for some reason, there are 
some ways to fuse it into code emission (e.g., passing additional 
information into Node::emit()).

> So I'm not sure I see this having fewer platform-specific hooks in the 
> end, unless I have missed something. You would seemingly still have to do
> something specific wheninitializing the buffer size, something specific 
> to branch shortening analysis that knows about the largest JCC padding 
> we apply,
> and somethingspecific (depending on adjacent nodes, that may have a 
> different size to the originally estimated size) during code emission.

> If you want to hide the platform hooks as much as possible, I think you 
> can refactor my solution to do that by exposing the current iteration state
> to global state e.g. Compile. That way, shared functions such as 
> MachNode::compute_padding and MachNode::alignment_required could move 
> into the platform
> layer and utilize the current iteration state to hide the platform 
> specific logic in seemingly shared functions, that call back to check 
> what is going
> on in the adjacent nodes.
> 
> However, I *really* don't like the practice of adding more and more 
> random stuff on Compile though. So let's say we make a preparatory patch
> to turn Output() into it's own Phase, so we can extract all the random 
> stuff from Compile that doesn't seem to belong there like this:
> 
> http://cr.openjdk.java.net/~eosterlund/8234160/webrev.02..03/
> 
> Now it seems more okay and we could do what I described and expose the 
> iteration state of the new PhaseOutput so that it becomes known to
> the platform-specific code,without dumping more random stuff on Compile:
> 
> http://cr.openjdk.java.net/~eosterlund/8234160/webrev.03..04/
> 
> Now we have removed all traces except the initial analysis hook from the 
> shared code. But at least it's in a function used for GC hooks to perform
> its analysis as well, so it kind of fits in almost as if it had a design 
> or something.

> If you like this approach, then perhaps I could perform said refactoring 
> as a follow-up RFE maybe? I'm thinking this refactoring touches quite a bit
> of code and should be separate, not to confuse people reading the 
> history, or trying to backport this. And I think I like the refactoring 
> regardless of
> JCC erratum code.
> 
> What do you think?

Nice! I like how the code shapes in both patches. And irrespective of 
where we go with the actual refactoring, PhaseOutput looks interesting 
in its own right.

Anyway, I'm fine with refactoring the patch as a follow-up activity.

webrev.02 looks good.

And thanks a lot for taking care of the problem and putting so much 
effort into it!

Best regards,
Vladimir Ivanov

> 
> Thanks,
> /Erik
> 
>> Best regards,
>> Vladimir Ivanov
>>
>>>> Have you considered extending MachNode::compute_padding() to do the 
>>>> job?
>>>
>>> I have. The MachNode::compute_padding() function virtual and allows 
>>> individual nodes to request padding.
>>> The padding I apply is not a per-node property. It concerns 
>>> consecutive nodes, due to macro fusion. So
>>> it did not seem like a good fit, and due to the virtual nature, it 
>>> would be messy to get it right.
>>
>>> I also intentionally want to retain the meaning of that per-node 
>>> information, to be JCC-erratum invariant.
>>> That allows me to actually use it to assert that the node itself does 
>>> not expect a padding other than the
>>> one I am enforcing due to the JCC erratum. This allows me to catch 
>>> bugs easily where the JCC erratum padding
>>> applied goes against the expectations of the node, enforcing that 
>>> expectations on both ends are honoured.
>>>
>>> There is already other code for applying padding that is not 
>>> node-specific, such as the avoid_back_to_back()
>>> logic, and this optimization seemed in spirit closer to that, as it 
>>> uses the index in the block. So
>>> that is why I solved it in a similar way.
>>>
>>> Thanks,
>>> /Erik
>>>
>>>> Best regards,
>>>> Vladimir Ivanov
>>>>
>>>>> On 11/25/19 4:31 PM, Vladimir Ivanov wrote:
>>>>>> Hi Erik,
>>>>>>
>>>>>>>> But I'd include stubs as well. Many of them are extensively used 
>>>>>>>> from C2-generated code.
>>>>>>>
>>>>>>> Okay. Any specific stubs you have in mind?If there are some 
>>>>>>> critical ones, we can sprinkle some scope objects like I did in 
>>>>>>> the ZGC code.
>>>>>>
>>>>>> There are intrinsics for compressed strings [1], numerous copy 
>>>>>> stubs [2], trigonometric functions [3].
>>>>>>
>>>>>> It would be unfortunate if we have to go over all that code and 
>>>>>> manually instrument all the places where problematic instructions 
>>>>>> are issued. Moreover, the process has to be repeated for new code 
>>>>>> being added over time.
>>>>>>
>>>>>>> I do have concerns though about injecting magic into the 
>>>>>>> MacroAssembler that tries to solve this automagically on the 
>>>>>>> assembly level, by having the assembler spit out different
>>>>>>> instructions than you requested.
>>>>>>> The following comment from assembler.hpp captures my thought 
>>>>>>> exactly:
>>>>>>>
>>>>>>> 207: // The Abstract Assembler: Pure assembler doing NO 
>>>>>>> optimizations on the
>>>>>>> 208: // instruction level; i.e., what you write is what you get.
>>>>>>> 209: // The Assembler is generating code into a CodeBuffer.
>>>>>>
>>>>>> While I see that Assembler follows that (instruction per method), 
>>>>>> MacroAssembler does not: there are cases when generated code 
>>>>>> differ depending on runtime flags (e.g., verification code) or 
>>>>>> input values (e.g., whether AddressLiteral is reachable or not).
>>>>>>
>>>>>>> I think it is desirable to keep the property that when we tell 
>>>>>>> the *Assembler to generate a __ cmp(); __ jcc(); it will do 
>>>>>>> exactly that.
>>>>>>> When such assumptions break, any code that has calculated the 
>>>>>>> size of instructions, making assumptions about their size, will 
>>>>>>> fail.
>>>>>>> For example, any MachNode with hardcoded size() might 
>>>>>>> underestimate how much memory is really needed, and code such as 
>>>>>>> nmethod entry barriers
>>>>>>> that have calculated the offset to the cmp immediate might 
>>>>>>> suddenly stop working because. There is similar code for oop maps 
>>>>>>> where we
>>>>>>> calculate offsets into mach nodes with oop maps to describe the 
>>>>>>> PC after a call, which will stop working:
>>>>>>>
>>>>>>> // !!!!! Special hack to get all types of calls to specify the 
>>>>>>> byte offset
>>>>>>> //       from the start of the call to the point where the return 
>>>>>>> address
>>>>>>> //       will point.
>>>>>>> int MachCallStaticJavaNode::ret_addr_offset()
>>>>>>> {
>>>>>>>    int offset = 5; // 5 bytes from start of call to where return 
>>>>>>> address points
>>>>>>>    offset += clear_avx_size();
>>>>>>>    return offset;
>>>>>>> }
>>>>>>>
>>>>>>> Basically, I think you might be able to mitigate more branches on 
>>>>>>> the MacroAssembler layer, but I think it would also be more 
>>>>>>> risky, as code that was
>>>>>>> not built for having random size will start failing, in places we 
>>>>>>> didn't think of.I can think of a few, and feel like there are 
>>>>>>> probably other places I have not thought about.
>>>>>>>
>>>>>>> So from that point of view, I think I would rather to this on 
>>>>>>> Mach nodes where it is safe, and I think we can catch the most 
>>>>>>> important ones there,
>>>>>>> and miss a few branches that the macro assembler would have found 
>>>>>>> with magic, than apply it to all branches and hope we find all 
>>>>>>> the bugs due to unexpected magic.
>>>>>>>
>>>>>>> Do you agree? Or perhaps I misunderstood what you are suggesting.
>>>>>>
>>>>>> You raise a valid point: there are places in the VM which rely on 
>>>>>> hard-coded instruction sequences. If such instruction changes, all 
>>>>>> relevant places have to be adjusted. And JVM is already very 
>>>>>> cautious about such cases.
>>>>>>
>>>>>> I agree with you that MacroAssembler-based more risky, but IMO the 
>>>>>> risk is modest (few places are affected) and manageable (dedicated 
>>>>>> stress mode should greatly improve test effectiveness).
>>>>>>
>>>>>> My opinion is that if we are satisfied with the coverage C2 CFG 
>>>>>> instrumentation provides and don't expect any more work on 
>>>>>> mitigations, then there's no motivation in investing into 
>>>>>> MacroAssembler-based approach.
>>>>>>
>>>>>> Otherwise, there are basically 2 options:
>>>>>>
>>>>>>   * "opt-in": explicitly mark all the places where mitigations are 
>>>>>> applied, by default nothing is mitigated
>>>>>>
>>>>>>   * "opt-out": mitigate everything unless mitigations are 
>>>>>> explicitly disabled
>>>>>>
>>>>>> Both approaches provide fine-grained control over what's being 
>>>>>> mitigated, but with "opt-out" there's more code to care about: 
>>>>>> it's easy to miss important cases and too tempting to enable more 
>>>>>> than we are 100% certain about.
>>>>>>
>>>>>> Both can be applied to individual CFG nodes and make CFG 
>>>>>> instrumentation redundant.
>>>>>>
>>>>>> But if there's a need to instrument large portions of 
>>>>>> (macro)assembly code, then IMO opt-in adds too much in terms of 
>>>>>> work required, noise (on code level), maintenance, and burden for 
>>>>>> future code changes. So, I don't consider it as a feasible option 
>>>>>> in such situation.
>>>>>>
>>>>>> It looks like a mixture of opt-in (explicitly enable in some 
>>>>>> context: in C2 during code emission, particular stub generation, 
>>>>>> etc) and opt-out (on the level of individual instructions) gives 
>>>>>> the best of both approaches.
>>>>>>
>>>>>> But, again, if C2 CFG instrumentation is good enough, then it'll 
>>>>>> be a wasted effort.
>>>>>>
>>>>>> So, I envision 3 possible scenarios:
>>>>>>
>>>>>>   (1) just instrument Mach IR and be done with it;
>>>>>>
>>>>>>   (2) (a) start with Mach IR;
>>>>>>       (b) later it turns out that extensive portions of 
>>>>>> (macro)assembly code have to me instrumented (or, for example, 
>>>>>> C1/Interpreter)
>>>>>>       (c) implement MacroAssembler mitigations
>>>>>>
>>>>>>   (3) start with MacroAssembler mitigations and be done with it
>>>>>>      * doesn't perclude gradual roll out across different subsystems
>>>>>>
>>>>>> Mach IR instrumentation (#1/#2) is the safest variant, but it may 
>>>>>> require more work.
>>>>>>
>>>>>> #3 is broadly applicable, but also riskier.
>>>>>>
>>>>>> What I don't consider as a viable option is C2 CFG instrumentation 
>>>>>> accompanied by numerous per-instruction mitigations scattered 
>>>>>> across the code base.
>>>>>>
>>>>>>>>> I have made a prototype, what this might look like and it looks 
>>>>>>>>> like this:
>>>>>>>>> http://cr.openjdk.java.net/~eosterlund/8234160/webrev.01/
>>>>>>>>
>>>>>>>> Just one more comment: it's weird to see intel_jcc_erratum 
>>>>>>>> referenced in shared code. You could #ifdef it for x86-only, but 
>>>>>>>> it's much better to move the code to x86-specific location.
>>>>>>>
>>>>>>> Sure, I can move that to an x86 file and make it build only on 
>>>>>>> x86_64.
>>>>>>
>>>>>> Yes, sounds good. But let's agree on general direction first.
>>>>>>
>>>>>> Best regards,
>>>>>> Vladimir Ivanov
>>>>>>
>>>>>> [1] 
>>>>>> http://hg.openjdk.java.net/jdk/jdk/file/tip/src/hotspot/cpu/x86/macroAssembler_x86.hpp#l1666 
>>>>>>
>>>>>>
>>>>>> [2] 
>>>>>> http://hg.openjdk.java.net/jdk/jdk/file/623722a6aeb9/src/hotspot/cpu/x86/stubGenerator_x86_64.cpp 
>>>>>>
>>>>>>
>>>>>> [3] http://hg.openjdk.java.net/jdk/jdk/file/tip/src/hotspot/cpu/x86/
>>>>>>     macroAssembler_x86_(sin|cos|...).cpp
>>>>>>
>>>>>
>>>
>