8234160: ZGC: Enable optimized mitigation for Intel jcc erratum in C2 load barrier

Fri Feb 21 18:00:58 UTC 2020

> IMO 2 passes are enough, but in case it's not for some reason, there are 
> some ways to fuse it into code emission (e.g., passing additional 
> information into Node::emit()).

To clarify: I'm talking here about 1 vs 2 passes specifically for custom 
padding (not a 3rd one). If additional pass turns out to be too 
expensive, the work can be fused into code emission pass.

Best regards,
Vladimir Ivanov

>> So I'm not sure I see this having fewer platform-specific hooks in the 
>> end, unless I have missed something. You would seemingly still have to do
>> something specific wheninitializing the buffer size, something 
>> specific to branch shortening analysis that knows about the largest 
>> JCC padding we apply,
>> and somethingspecific (depending on adjacent nodes, that may have a 
>> different size to the originally estimated size) during code emission.
> 
>> If you want to hide the platform hooks as much as possible, I think 
>> you can refactor my solution to do that by exposing the current 
>> iteration state
>> to global state e.g. Compile. That way, shared functions such as 
>> MachNode::compute_padding and MachNode::alignment_required could move 
>> into the platform
>> layer and utilize the current iteration state to hide the platform 
>> specific logic in seemingly shared functions, that call back to check 
>> what is going
>> on in the adjacent nodes.
>>
>> However, I *really* don't like the practice of adding more and more 
>> random stuff on Compile though. So let's say we make a preparatory patch
>> to turn Output() into it's own Phase, so we can extract all the random 
>> stuff from Compile that doesn't seem to belong there like this:
>>
>> http://cr.openjdk.java.net/~eosterlund/8234160/webrev.02..03/
>>
>> Now it seems more okay and we could do what I described and expose the 
>> iteration state of the new PhaseOutput so that it becomes known to
>> the platform-specific code,without dumping more random stuff on Compile:
>>
>> http://cr.openjdk.java.net/~eosterlund/8234160/webrev.03..04/
>>
>> Now we have removed all traces except the initial analysis hook from 
>> the shared code. But at least it's in a function used for GC hooks to 
>> perform
>> its analysis as well, so it kind of fits in almost as if it had a 
>> design or something.
> 
>> If you like this approach, then perhaps I could perform said 
>> refactoring as a follow-up RFE maybe? I'm thinking this refactoring 
>> touches quite a bit
>> of code and should be separate, not to confuse people reading the 
>> history, or trying to backport this. And I think I like the 
>> refactoring regardless of
>> JCC erratum code.
>>
>> What do you think?
> 
> Nice! I like how the code shapes in both patches. And irrespective of 
> where we go with the actual refactoring, PhaseOutput looks interesting 
> in its own right.
> 
> Anyway, I'm fine with refactoring the patch as a follow-up activity.
> 
> webrev.02 looks good.
> 
> And thanks a lot for taking care of the problem and putting so much 
> effort into it!
> 
> Best regards,
> Vladimir Ivanov
> 
>>
>> Thanks,
>> /Erik
>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>>>> Have you considered extending MachNode::compute_padding() to do the 
>>>>> job?
>>>>
>>>> I have. The MachNode::compute_padding() function virtual and allows 
>>>> individual nodes to request padding.
>>>> The padding I apply is not a per-node property. It concerns 
>>>> consecutive nodes, due to macro fusion. So
>>>> it did not seem like a good fit, and due to the virtual nature, it 
>>>> would be messy to get it right.
>>>
>>>> I also intentionally want to retain the meaning of that per-node 
>>>> information, to be JCC-erratum invariant.
>>>> That allows me to actually use it to assert that the node itself 
>>>> does not expect a padding other than the
>>>> one I am enforcing due to the JCC erratum. This allows me to catch 
>>>> bugs easily where the JCC erratum padding
>>>> applied goes against the expectations of the node, enforcing that 
>>>> expectations on both ends are honoured.
>>>>
>>>> There is already other code for applying padding that is not 
>>>> node-specific, such as the avoid_back_to_back()
>>>> logic, and this optimization seemed in spirit closer to that, as it 
>>>> uses the index in the block. So
>>>> that is why I solved it in a similar way.
>>>>
>>>> Thanks,
>>>> /Erik
>>>>
>>>>> Best regards,
>>>>> Vladimir Ivanov
>>>>>
>>>>>> On 11/25/19 4:31 PM, Vladimir Ivanov wrote:
>>>>>>> Hi Erik,
>>>>>>>
>>>>>>>>> But I'd include stubs as well. Many of them are extensively 
>>>>>>>>> used from C2-generated code.
>>>>>>>>
>>>>>>>> Okay. Any specific stubs you have in mind?If there are some 
>>>>>>>> critical ones, we can sprinkle some scope objects like I did in 
>>>>>>>> the ZGC code.
>>>>>>>
>>>>>>> There are intrinsics for compressed strings [1], numerous copy 
>>>>>>> stubs [2], trigonometric functions [3].
>>>>>>>
>>>>>>> It would be unfortunate if we have to go over all that code and 
>>>>>>> manually instrument all the places where problematic instructions 
>>>>>>> are issued. Moreover, the process has to be repeated for new code 
>>>>>>> being added over time.
>>>>>>>
>>>>>>>> I do have concerns though about injecting magic into the 
>>>>>>>> MacroAssembler that tries to solve this automagically on the 
>>>>>>>> assembly level, by having the assembler spit out different
>>>>>>>> instructions than you requested.
>>>>>>>> The following comment from assembler.hpp captures my thought 
>>>>>>>> exactly:
>>>>>>>>
>>>>>>>> 207: // The Abstract Assembler: Pure assembler doing NO 
>>>>>>>> optimizations on the
>>>>>>>> 208: // instruction level; i.e., what you write is what you get.
>>>>>>>> 209: // The Assembler is generating code into a CodeBuffer.
>>>>>>>
>>>>>>> While I see that Assembler follows that (instruction per method), 
>>>>>>> MacroAssembler does not: there are cases when generated code 
>>>>>>> differ depending on runtime flags (e.g., verification code) or 
>>>>>>> input values (e.g., whether AddressLiteral is reachable or not).
>>>>>>>
>>>>>>>> I think it is desirable to keep the property that when we tell 
>>>>>>>> the *Assembler to generate a __ cmp(); __ jcc(); it will do 
>>>>>>>> exactly that.
>>>>>>>> When such assumptions break, any code that has calculated the 
>>>>>>>> size of instructions, making assumptions about their size, will 
>>>>>>>> fail.
>>>>>>>> For example, any MachNode with hardcoded size() might 
>>>>>>>> underestimate how much memory is really needed, and code such as 
>>>>>>>> nmethod entry barriers
>>>>>>>> that have calculated the offset to the cmp immediate might 
>>>>>>>> suddenly stop working because. There is similar code for oop 
>>>>>>>> maps where we
>>>>>>>> calculate offsets into mach nodes with oop maps to describe the 
>>>>>>>> PC after a call, which will stop working:
>>>>>>>>
>>>>>>>> // !!!!! Special hack to get all types of calls to specify the 
>>>>>>>> byte offset
>>>>>>>> //       from the start of the call to the point where the 
>>>>>>>> return address
>>>>>>>> //       will point.
>>>>>>>> int MachCallStaticJavaNode::ret_addr_offset()
>>>>>>>> {
>>>>>>>>    int offset = 5; // 5 bytes from start of call to where return 
>>>>>>>> address points
>>>>>>>>    offset += clear_avx_size();
>>>>>>>>    return offset;
>>>>>>>> }
>>>>>>>>
>>>>>>>> Basically, I think you might be able to mitigate more branches 
>>>>>>>> on the MacroAssembler layer, but I think it would also be more 
>>>>>>>> risky, as code that was
>>>>>>>> not built for having random size will start failing, in places 
>>>>>>>> we didn't think of.I can think of a few, and feel like there are 
>>>>>>>> probably other places I have not thought about.
>>>>>>>>
>>>>>>>> So from that point of view, I think I would rather to this on 
>>>>>>>> Mach nodes where it is safe, and I think we can catch the most 
>>>>>>>> important ones there,
>>>>>>>> and miss a few branches that the macro assembler would have 
>>>>>>>> found with magic, than apply it to all branches and hope we find 
>>>>>>>> all the bugs due to unexpected magic.
>>>>>>>>
>>>>>>>> Do you agree? Or perhaps I misunderstood what you are suggesting.
>>>>>>>
>>>>>>> You raise a valid point: there are places in the VM which rely on 
>>>>>>> hard-coded instruction sequences. If such instruction changes, 
>>>>>>> all relevant places have to be adjusted. And JVM is already very 
>>>>>>> cautious about such cases.
>>>>>>>
>>>>>>> I agree with you that MacroAssembler-based more risky, but IMO 
>>>>>>> the risk is modest (few places are affected) and manageable 
>>>>>>> (dedicated stress mode should greatly improve test effectiveness).
>>>>>>>
>>>>>>> My opinion is that if we are satisfied with the coverage C2 CFG 
>>>>>>> instrumentation provides and don't expect any more work on 
>>>>>>> mitigations, then there's no motivation in investing into 
>>>>>>> MacroAssembler-based approach.
>>>>>>>
>>>>>>> Otherwise, there are basically 2 options:
>>>>>>>
>>>>>>>   * "opt-in": explicitly mark all the places where mitigations 
>>>>>>> are applied, by default nothing is mitigated
>>>>>>>
>>>>>>>   * "opt-out": mitigate everything unless mitigations are 
>>>>>>> explicitly disabled
>>>>>>>
>>>>>>> Both approaches provide fine-grained control over what's being 
>>>>>>> mitigated, but with "opt-out" there's more code to care about: 
>>>>>>> it's easy to miss important cases and too tempting to enable more 
>>>>>>> than we are 100% certain about.
>>>>>>>
>>>>>>> Both can be applied to individual CFG nodes and make CFG 
>>>>>>> instrumentation redundant.
>>>>>>>
>>>>>>> But if there's a need to instrument large portions of 
>>>>>>> (macro)assembly code, then IMO opt-in adds too much in terms of 
>>>>>>> work required, noise (on code level), maintenance, and burden for 
>>>>>>> future code changes. So, I don't consider it as a feasible option 
>>>>>>> in such situation.
>>>>>>>
>>>>>>> It looks like a mixture of opt-in (explicitly enable in some 
>>>>>>> context: in C2 during code emission, particular stub generation, 
>>>>>>> etc) and opt-out (on the level of individual instructions) gives 
>>>>>>> the best of both approaches.
>>>>>>>
>>>>>>> But, again, if C2 CFG instrumentation is good enough, then it'll 
>>>>>>> be a wasted effort.
>>>>>>>
>>>>>>> So, I envision 3 possible scenarios:
>>>>>>>
>>>>>>>   (1) just instrument Mach IR and be done with it;
>>>>>>>
>>>>>>>   (2) (a) start with Mach IR;
>>>>>>>       (b) later it turns out that extensive portions of 
>>>>>>> (macro)assembly code have to me instrumented (or, for example, 
>>>>>>> C1/Interpreter)
>>>>>>>       (c) implement MacroAssembler mitigations
>>>>>>>
>>>>>>>   (3) start with MacroAssembler mitigations and be done with it
>>>>>>>      * doesn't perclude gradual roll out across different subsystems
>>>>>>>
>>>>>>> Mach IR instrumentation (#1/#2) is the safest variant, but it may 
>>>>>>> require more work.
>>>>>>>
>>>>>>> #3 is broadly applicable, but also riskier.
>>>>>>>
>>>>>>> What I don't consider as a viable option is C2 CFG 
>>>>>>> instrumentation accompanied by numerous per-instruction 
>>>>>>> mitigations scattered across the code base.
>>>>>>>
>>>>>>>>>> I have made a prototype, what this might look like and it 
>>>>>>>>>> looks like this:
>>>>>>>>>> http://cr.openjdk.java.net/~eosterlund/8234160/webrev.01/
>>>>>>>>>
>>>>>>>>> Just one more comment: it's weird to see intel_jcc_erratum 
>>>>>>>>> referenced in shared code. You could #ifdef it for x86-only, 
>>>>>>>>> but it's much better to move the code to x86-specific location.
>>>>>>>>
>>>>>>>> Sure, I can move that to an x86 file and make it build only on 
>>>>>>>> x86_64.
>>>>>>>
>>>>>>> Yes, sounds good. But let's agree on general direction first.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Vladimir Ivanov
>>>>>>>
>>>>>>> [1] 
>>>>>>> http://hg.openjdk.java.net/jdk/jdk/file/tip/src/hotspot/cpu/x86/macroAssembler_x86.hpp#l1666 
>>>>>>>
>>>>>>>
>>>>>>> [2] 
>>>>>>> http://hg.openjdk.java.net/jdk/jdk/file/623722a6aeb9/src/hotspot/cpu/x86/stubGenerator_x86_64.cpp 
>>>>>>>
>>>>>>>
>>>>>>> [3] http://hg.openjdk.java.net/jdk/jdk/file/tip/src/hotspot/cpu/x86/
>>>>>>>     macroAssembler_x86_(sin|cos|...).cpp
>>>>>>>
>>>>>>
>>>>
>>