8234160: ZGC: Enable optimized mitigation for Intel jcc erratum in C2 load barrier

Fri Feb 21 15:28:20 UTC 2020

Hi Vladimir,

On 2/20/20 10:34 AM, Vladimir Ivanov wrote:
> Hi Erik,
>
>>>> http://cr.openjdk.java.net/~eosterlund/8234160/webrev.01/
>>>
>>> Can x86-specific changes in output.cpp & node.hpp be moved to a 
>>> arch-specific location? x86.ad maybe?
>>
>> The node.hpp changes are just allowing a flag bit for the mitigation. 
>> I can't really move that code.
>> In output.cpp, I only have the hooks to the platform specific code in 
>> there. There are 3 hooks,
>> and I need them. The code that actually does something is in platform 
>> specific code though.
>
> You sprinkle platform-specific portions in shared code. What if you 
> remove new #ifdefs? As they are now, neither output.cpp/node.hpp 
> changes don't make any sense on non-x86 platforms and the #ifdefs add 
> unnecessary noise.
>
> We try to clean up existing occurrences of platform-specific code 
> (unfortunately, we have a long backlog) and avoid introducing new 
> cases unless it's inevitable. It favors genericity of shared code over 
> convenience and forces platforms to pay for capabilities/features 
> nobody else is using.

I do see your concern, and agree it is not nice with platform specific 
code in shared code. There needs to be some kind
of call from shared code to platform specific code though, as this is 
inherently platform specific.
It can take different forms, though, and try to reuse existing shared hooks.

> I'll try to describe an alternative approach which doesn't require 
> intrusive changes in code emission phase.
>
> I'm looking at:
>
>  47 bool IntelJccErratum::is_jcc_erratum_branch(const Block* block, 
> const MachNode* node, uint node_index) {
>  48   if (node->is_MachCall() && !node->is_MachCallJava()) {
>  49     return true;
>  50   }
>  51   return node_index == block->number_of_nodes() - 1;
>  52 }
>
> ...
>
>  59 int IntelJccErratum::tag_affected_machnodes(Compile* C, PhaseCFG*
> ...
>  76         if (!m->is_MachReturn() && !m->is_MachCall()) {
>  77           // We might fuse a problematic jcc erratum branch with a 
> preceding
>  78           // ALU instruction - we must catch such problematic 
> macro fusions
>  79           // and flag the ALU instruction as problematic too.
>
> For Calls it's per-node property and macro-fusion is relevant only to 
> the last pair of nodes in the block.
>
> So, Calls can be covered by overriding relevant compute_padding() in 
> x86_64.ad.
>
> Regarding macro-fused pairs, I propose to cover them with a separate 
> pass: find all occurrences and insert a customized MachNopNode right 
> before each one. (It can be used to pad Call nodes as well to make the 
> implementation uniform.)
>
> The following code can be split between the pass itself (compute 
> instruction size):
>
>  97 int IntelJccErratum::compute_padding(uintptr_t current_offset, 
> const MachNode* mach, Block* block, uint index_in_block, 
> PhaseRegAlloc* regalloc) {
>  98   int jcc_size = mach->size(regalloc);
>  99   if (index_in_block < block->number_of_nodes() - 1) {
> 100     Node* next = block->get_node(index_in_block + 1);
> 101     if (next->is_Mach() && (next->as_Mach()->flags() & 
> Node::Flag_intel_jcc_erratum)) {
> 102       jcc_size += mach->size(regalloc);
> 103     }
> 104   }
> 105   if (jcc_size > largest_jcc_size()) {
> 106     // Let's not try fixing this for nodes that seem unreasonably 
> large
> 107     return false;
> 108   }
>
> and code emission (encapsulated into compute_offset() of the new node):
>
> 109   if (is_crossing_or_ending_at_32_byte_boundary(current_offset, 
> current_offset + jcc_size)) {
> 110     return int(align_up(current_offset, 32) - current_offset);
> 111   } else {
> 112     return 0;
> 113   }
> 114 }
>
> (Padding invariant can be asserted as part of the pass when computing 
> instruction sizes.)
>
> Also, it doesn't require mach nodes to be marked since important 
> information is reified in the IR as a node.
>
> Both the pass and the node can be put in x86-specific location and 
> surfaced into shared code on Matcher as a flag and a method 
> declaration with platform-dependent implementations (empty except 
> x86). (It's a standard way to perform platform-specific operations on 
> mach IR.) The pass can be run before Compile::Output().
>
> (PS: another idea is to repurpose peephole transformation and 
> introduce rules for macro-fused pairs, but it looks like it would 
> require much more changes.)

There are indeed different ways of hooking it up. I have 3 hooks:

1) Perform platform-specific analysis (combined with additions to 
initial blob size)
2) Perform platform-specific padding during code emission when the 
position of things is fixed and sizes are accurate and not estimated
3) Perform platform-specific conservative code size estimation for 
branch shortening (before we know actual sizes), allowing branch 
shortening to still take place

The approach you describe seems to be (roughly):
1) Perform platform-specific analysis, injecting a new special nop mach node
2) Perform platform-specific padding during code emission (yet hidden 
behind shared-looking compute_padding hooks)

But remember we have to take care of both initial blob sizing and branch 
shortening, applying conservative size measurements, in addition
to the real padding we apply in the end. Your platform-specific analysis 
phase probably needs to do something before branch shortening,
as branch shortening needs to know about the special alignment, at least 
as a conservative estimate. But at this point, we do not yet know
the real size of mach nodes. We only know that when they are emitted in 
the code buffer. Only then do we apply the real padding, and the size
of that padding depends on the adjacent nodes to the special mach 
node.So we would seemingly have to do another pass through once sizes
have calmed down after branch shortening, to perform another round of 
analysis feeding adjacent mach node information into our special nops.

So I'm not sure I see this having fewer platform-specific hooks in the 
end, unless I have missed something. You would seemingly still have to do
something specific wheninitializing the buffer size, something specific 
to branch shortening analysis that knows about the largest JCC padding 
we apply,
and somethingspecific (depending on adjacent nodes, that may have a 
different size to the originally estimated size) during code emission.

If you want to hide the platform hooks as much as possible, I think you 
can refactor my solution to do that by exposing the current iteration state
to global state e.g. Compile. That way, shared functions such as 
MachNode::compute_padding and MachNode::alignment_required could move 
into the platform
layer and utilize the current iteration state to hide the platform 
specific logic in seemingly shared functions, that call back to check 
what is going
on in the adjacent nodes.

However, I *really* don't like the practice of adding more and more 
random stuff on Compile though. So let's say we make a preparatory patch
to turn Output() into it's own Phase, so we can extract all the random 
stuff from Compile that doesn't seem to belong there like this:

http://cr.openjdk.java.net/~eosterlund/8234160/webrev.02..03/

Now it seems more okay and we could do what I described and expose the 
iteration state of the new PhaseOutput so that it becomes known to
the platform-specific code,without dumping more random stuff on Compile:

http://cr.openjdk.java.net/~eosterlund/8234160/webrev.03..04/

Now we have removed all traces except the initial analysis hook from the 
shared code. But at least it's in a function used for GC hooks to perform
its analysis as well, so it kind of fits in almost as if it had a design 
or something.

If you like this approach, then perhaps I could perform said refactoring 
as a follow-up RFE maybe? I'm thinking this refactoring touches quite a bit
of code and should be separate, not to confuse people reading the 
history, or trying to backport this. And I think I like the refactoring 
regardless of
JCC erratum code.

What do you think?

Thanks,
/Erik

> Best regards,
> Vladimir Ivanov
>
>>> Have you considered extending MachNode::compute_padding() to do the 
>>> job?
>>
>> I have. The MachNode::compute_padding() function virtual and allows 
>> individual nodes to request padding.
>> The padding I apply is not a per-node property. It concerns 
>> consecutive nodes, due to macro fusion. So
>> it did not seem like a good fit, and due to the virtual nature, it 
>> would be messy to get it right.
>
>> I also intentionally want to retain the meaning of that per-node 
>> information, to be JCC-erratum invariant.
>> That allows me to actually use it to assert that the node itself does 
>> not expect a padding other than the
>> one I am enforcing due to the JCC erratum. This allows me to catch 
>> bugs easily where the JCC erratum padding
>> applied goes against the expectations of the node, enforcing that 
>> expectations on both ends are honoured.
>>
>> There is already other code for applying padding that is not 
>> node-specific, such as the avoid_back_to_back()
>> logic, and this optimization seemed in spirit closer to that, as it 
>> uses the index in the block. So
>> that is why I solved it in a similar way.
>>
>> Thanks,
>> /Erik
>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>>> On 11/25/19 4:31 PM, Vladimir Ivanov wrote:
>>>>> Hi Erik,
>>>>>
>>>>>>> But I'd include stubs as well. Many of them are extensively used 
>>>>>>> from C2-generated code.
>>>>>>
>>>>>> Okay. Any specific stubs you have in mind?If there are some 
>>>>>> critical ones, we can sprinkle some scope objects like I did in 
>>>>>> the ZGC code.
>>>>>
>>>>> There are intrinsics for compressed strings [1], numerous copy 
>>>>> stubs [2], trigonometric functions [3].
>>>>>
>>>>> It would be unfortunate if we have to go over all that code and 
>>>>> manually instrument all the places where problematic instructions 
>>>>> are issued. Moreover, the process has to be repeated for new code 
>>>>> being added over time.
>>>>>
>>>>>> I do have concerns though about injecting magic into the 
>>>>>> MacroAssembler that tries to solve this automagically on the 
>>>>>> assembly level, by having the assembler spit out different
>>>>>> instructions than you requested.
>>>>>> The following comment from assembler.hpp captures my thought 
>>>>>> exactly:
>>>>>>
>>>>>> 207: // The Abstract Assembler: Pure assembler doing NO 
>>>>>> optimizations on the
>>>>>> 208: // instruction level; i.e., what you write is what you get.
>>>>>> 209: // The Assembler is generating code into a CodeBuffer.
>>>>>
>>>>> While I see that Assembler follows that (instruction per method), 
>>>>> MacroAssembler does not: there are cases when generated code 
>>>>> differ depending on runtime flags (e.g., verification code) or 
>>>>> input values (e.g., whether AddressLiteral is reachable or not).
>>>>>
>>>>>> I think it is desirable to keep the property that when we tell 
>>>>>> the *Assembler to generate a __ cmp(); __ jcc(); it will do 
>>>>>> exactly that.
>>>>>> When such assumptions break, any code that has calculated the 
>>>>>> size of instructions, making assumptions about their size, will 
>>>>>> fail.
>>>>>> For example, any MachNode with hardcoded size() might 
>>>>>> underestimate how much memory is really needed, and code such as 
>>>>>> nmethod entry barriers
>>>>>> that have calculated the offset to the cmp immediate might 
>>>>>> suddenly stop working because. There is similar code for oop maps 
>>>>>> where we
>>>>>> calculate offsets into mach nodes with oop maps to describe the 
>>>>>> PC after a call, which will stop working:
>>>>>>
>>>>>> // !!!!! Special hack to get all types of calls to specify the 
>>>>>> byte offset
>>>>>> //       from the start of the call to the point where the return 
>>>>>> address
>>>>>> //       will point.
>>>>>> int MachCallStaticJavaNode::ret_addr_offset()
>>>>>> {
>>>>>>    int offset = 5; // 5 bytes from start of call to where return 
>>>>>> address points
>>>>>>    offset += clear_avx_size();
>>>>>>    return offset;
>>>>>> }
>>>>>>
>>>>>> Basically, I think you might be able to mitigate more branches on 
>>>>>> the MacroAssembler layer, but I think it would also be more 
>>>>>> risky, as code that was
>>>>>> not built for having random size will start failing, in places we 
>>>>>> didn't think of.I can think of a few, and feel like there are 
>>>>>> probably other places I have not thought about.
>>>>>>
>>>>>> So from that point of view, I think I would rather to this on 
>>>>>> Mach nodes where it is safe, and I think we can catch the most 
>>>>>> important ones there,
>>>>>> and miss a few branches that the macro assembler would have found 
>>>>>> with magic, than apply it to all branches and hope we find all 
>>>>>> the bugs due to unexpected magic.
>>>>>>
>>>>>> Do you agree? Or perhaps I misunderstood what you are suggesting.
>>>>>
>>>>> You raise a valid point: there are places in the VM which rely on 
>>>>> hard-coded instruction sequences. If such instruction changes, all 
>>>>> relevant places have to be adjusted. And JVM is already very 
>>>>> cautious about such cases.
>>>>>
>>>>> I agree with you that MacroAssembler-based more risky, but IMO the 
>>>>> risk is modest (few places are affected) and manageable (dedicated 
>>>>> stress mode should greatly improve test effectiveness).
>>>>>
>>>>> My opinion is that if we are satisfied with the coverage C2 CFG 
>>>>> instrumentation provides and don't expect any more work on 
>>>>> mitigations, then there's no motivation in investing into 
>>>>> MacroAssembler-based approach.
>>>>>
>>>>> Otherwise, there are basically 2 options:
>>>>>
>>>>>   * "opt-in": explicitly mark all the places where mitigations are 
>>>>> applied, by default nothing is mitigated
>>>>>
>>>>>   * "opt-out": mitigate everything unless mitigations are 
>>>>> explicitly disabled
>>>>>
>>>>> Both approaches provide fine-grained control over what's being 
>>>>> mitigated, but with "opt-out" there's more code to care about: 
>>>>> it's easy to miss important cases and too tempting to enable more 
>>>>> than we are 100% certain about.
>>>>>
>>>>> Both can be applied to individual CFG nodes and make CFG 
>>>>> instrumentation redundant.
>>>>>
>>>>> But if there's a need to instrument large portions of 
>>>>> (macro)assembly code, then IMO opt-in adds too much in terms of 
>>>>> work required, noise (on code level), maintenance, and burden for 
>>>>> future code changes. So, I don't consider it as a feasible option 
>>>>> in such situation.
>>>>>
>>>>> It looks like a mixture of opt-in (explicitly enable in some 
>>>>> context: in C2 during code emission, particular stub generation, 
>>>>> etc) and opt-out (on the level of individual instructions) gives 
>>>>> the best of both approaches.
>>>>>
>>>>> But, again, if C2 CFG instrumentation is good enough, then it'll 
>>>>> be a wasted effort.
>>>>>
>>>>> So, I envision 3 possible scenarios:
>>>>>
>>>>>   (1) just instrument Mach IR and be done with it;
>>>>>
>>>>>   (2) (a) start with Mach IR;
>>>>>       (b) later it turns out that extensive portions of 
>>>>> (macro)assembly code have to me instrumented (or, for example, 
>>>>> C1/Interpreter)
>>>>>       (c) implement MacroAssembler mitigations
>>>>>
>>>>>   (3) start with MacroAssembler mitigations and be done with it
>>>>>      * doesn't perclude gradual roll out across different subsystems
>>>>>
>>>>> Mach IR instrumentation (#1/#2) is the safest variant, but it may 
>>>>> require more work.
>>>>>
>>>>> #3 is broadly applicable, but also riskier.
>>>>>
>>>>> What I don't consider as a viable option is C2 CFG instrumentation 
>>>>> accompanied by numerous per-instruction mitigations scattered 
>>>>> across the code base.
>>>>>
>>>>>>>> I have made a prototype, what this might look like and it looks 
>>>>>>>> like this:
>>>>>>>> http://cr.openjdk.java.net/~eosterlund/8234160/webrev.01/
>>>>>>>
>>>>>>> Just one more comment: it's weird to see intel_jcc_erratum 
>>>>>>> referenced in shared code. You could #ifdef it for x86-only, but 
>>>>>>> it's much better to move the code to x86-specific location.
>>>>>>
>>>>>> Sure, I can move that to an x86 file and make it build only on 
>>>>>> x86_64.
>>>>>
>>>>> Yes, sounds good. But let's agree on general direction first.
>>>>>
>>>>> Best regards,
>>>>> Vladimir Ivanov
>>>>>
>>>>> [1] 
>>>>> http://hg.openjdk.java.net/jdk/jdk/file/tip/src/hotspot/cpu/x86/macroAssembler_x86.hpp#l1666 
>>>>>
>>>>>
>>>>> [2] 
>>>>> http://hg.openjdk.java.net/jdk/jdk/file/623722a6aeb9/src/hotspot/cpu/x86/stubGenerator_x86_64.cpp 
>>>>>
>>>>>
>>>>> [3] http://hg.openjdk.java.net/jdk/jdk/file/tip/src/hotspot/cpu/x86/
>>>>>     macroAssembler_x86_(sin|cos|...).cpp
>>>>>
>>>>
>>