8234160: ZGC: Enable optimized mitigation for Intel jcc erratum in C2 load barrier

Wed Feb 19 08:50:14 UTC 2020

Hi Erik,

thanks a lot for all the effort you've put into this.

Webrev.01 looks good to me. Some minor comments:
- Copyright needs to be updated
- vm_version_x86.hpp:895 "crosses or ends" -> "cross or end"
- c2_intelJccErratum_x86.cpp:51 Please add parentheses

I think we should also file a follow-up bug/enhancement to keep track of possible mitigations to
other code. Did you evaluate if Intel's mitigation has a performance impact on these highly used
intrinsics/stubs mentioned by Vladimir?

Could you also change the title of the bug to be non-ZGC specific?

Best regards,
Tobias

On 14.02.20 17:46, erik.osterlund at oracle.com wrote:
> Hi everyone,
> 
> Eric Caspole helped me evaluate my solution thoroughly. Thanks a lot for that. The results were good.
> In practice, my patch mitigated almost all of the regressions in C2 code caused by the JCC erratum
> microcode mitigation.
> There were a few cases where the nop sprinkling did hurt a few percent, possibly due to instruction
> cache pollution.
> 
> Because of the positive outcome of the experiments, I would like to propose we go with my
> opt-in & safety-first based solution that I already posted earlier.
> 
> I have talked to Vladimir, and I think we at this point in time agree about doing this.
> 
> As a reminder, here is my webrev that I proposed (+ rebase that I just made):
> http://cr.openjdk.java.net/~eosterlund/8234160/webrev.01/
> 
> Thanks,
> /Erik
> 
> On 11/25/19 4:31 PM, Vladimir Ivanov wrote:
>> Hi Erik,
>>
>>>> But I'd include stubs as well. Many of them are extensively used from C2-generated code.
>>>
>>> Okay. Any specific stubs you have in mind?If there are some critical ones, we can sprinkle some
>>> scope objects like I did in the ZGC code.
>>
>> There are intrinsics for compressed strings [1], numerous copy stubs [2], trigonometric functions
>> [3].
>>
>> It would be unfortunate if we have to go over all that code and manually instrument all the places
>> where problematic instructions are issued. Moreover, the process has to be repeated for new code
>> being added over time.
>>
>>> I do have concerns though about injecting magic into the MacroAssembler that tries to solve this
>>> automagically on the assembly level, by having the assembler spit out different
>>> instructions than you requested.
>>> The following comment from assembler.hpp captures my thought exactly:
>>>
>>> 207: // The Abstract Assembler: Pure assembler doing NO optimizations on the
>>> 208: // instruction level; i.e., what you write is what you get.
>>> 209: // The Assembler is generating code into a CodeBuffer.
>>
>> While I see that Assembler follows that (instruction per method), MacroAssembler does not: there
>> are cases when generated code differ depending on runtime flags (e.g., verification code) or input
>> values (e.g., whether AddressLiteral is reachable or not).
>>
>>> I think it is desirable to keep the property that when we tell the *Assembler to generate a __
>>> cmp(); __ jcc(); it will do exactly that.
>>> When such assumptions break, any code that has calculated the size of instructions, making
>>> assumptions about their size, will fail.
>>> For example, any MachNode with hardcoded size() might underestimate how much memory is really
>>> needed, and code such as nmethod entry barriers
>>> that have calculated the offset to the cmp immediate might suddenly stop working because. There
>>> is similar code for oop maps where we
>>> calculate offsets into mach nodes with oop maps to describe the PC after a call, which will stop
>>> working:
>>>
>>> // !!!!! Special hack to get all types of calls to specify the byte offset
>>> //       from the start of the call to the point where the return address
>>> //       will point.
>>> int MachCallStaticJavaNode::ret_addr_offset()
>>> {
>>>    int offset = 5; // 5 bytes from start of call to where return address points
>>>    offset += clear_avx_size();
>>>    return offset;
>>> }
>>>
>>> Basically, I think you might be able to mitigate more branches on the MacroAssembler layer, but I
>>> think it would also be more risky, as code that was
>>> not built for having random size will start failing, in places we didn't think of.I can think of
>>> a few, and feel like there are probably other places I have not thought about.
>>>
>>> So from that point of view, I think I would rather to this on Mach nodes where it is safe, and I
>>> think we can catch the most important ones there,
>>> and miss a few branches that the macro assembler would have found with magic, than apply it to
>>> all branches and hope we find all the bugs due to unexpected magic.
>>>
>>> Do you agree? Or perhaps I misunderstood what you are suggesting.
>>
>> You raise a valid point: there are places in the VM which rely on hard-coded instruction
>> sequences. If such instruction changes, all relevant places have to be adjusted. And JVM is
>> already very cautious about such cases.
>>
>> I agree with you that MacroAssembler-based more risky, but IMO the risk is modest (few places are
>> affected) and manageable (dedicated stress mode should greatly improve test effectiveness).
>>
>> My opinion is that if we are satisfied with the coverage C2 CFG instrumentation provides and don't
>> expect any more work on mitigations, then there's no motivation in investing into
>> MacroAssembler-based approach.
>>
>> Otherwise, there are basically 2 options:
>>
>>   * "opt-in": explicitly mark all the places where mitigations are applied, by default nothing is
>> mitigated
>>
>>   * "opt-out": mitigate everything unless mitigations are explicitly disabled
>>
>> Both approaches provide fine-grained control over what's being mitigated, but with "opt-out"
>> there's more code to care about: it's easy to miss important cases and too tempting to enable more
>> than we are 100% certain about.
>>
>> Both can be applied to individual CFG nodes and make CFG instrumentation redundant.
>>
>> But if there's a need to instrument large portions of (macro)assembly code, then IMO opt-in adds
>> too much in terms of work required, noise (on code level), maintenance, and burden for future code
>> changes. So, I don't consider it as a feasible option in such situation.
>>
>> It looks like a mixture of opt-in (explicitly enable in some context: in C2 during code emission,
>> particular stub generation, etc) and opt-out (on the level of individual instructions) gives the
>> best of both approaches.
>>
>> But, again, if C2 CFG instrumentation is good enough, then it'll be a wasted effort.
>>
>> So, I envision 3 possible scenarios:
>>
>>   (1) just instrument Mach IR and be done with it;
>>
>>   (2) (a) start with Mach IR;
>>       (b) later it turns out that extensive portions of (macro)assembly code have to me
>> instrumented (or, for example, C1/Interpreter)
>>       (c) implement MacroAssembler mitigations
>>
>>   (3) start with MacroAssembler mitigations and be done with it
>>      * doesn't perclude gradual roll out across different subsystems
>>
>> Mach IR instrumentation (#1/#2) is the safest variant, but it may require more work.
>>
>> #3 is broadly applicable, but also riskier.
>>
>> What I don't consider as a viable option is C2 CFG instrumentation accompanied by numerous
>> per-instruction mitigations scattered across the code base.
>>
>>>>> I have made a prototype, what this might look like and it looks like this:
>>>>> http://cr.openjdk.java.net/~eosterlund/8234160/webrev.01/
>>>>
>>>> Just one more comment: it's weird to see intel_jcc_erratum referenced in shared code. You could
>>>> #ifdef it for x86-only, but it's much better to move the code to x86-specific location.
>>>
>>> Sure, I can move that to an x86 file and make it build only on x86_64.
>>
>> Yes, sounds good. But let's agree on general direction first.
>>
>> Best regards,
>> Vladimir Ivanov
>>
>> [1] http://hg.openjdk.java.net/jdk/jdk/file/tip/src/hotspot/cpu/x86/macroAssembler_x86.hpp#l1666
>>
>> [2] http://hg.openjdk.java.net/jdk/jdk/file/623722a6aeb9/src/hotspot/cpu/x86/stubGenerator_x86_64.cpp
>>
>> [3] http://hg.openjdk.java.net/jdk/jdk/file/tip/src/hotspot/cpu/x86/
>>     macroAssembler_x86_(sin|cos|...).cpp
>>
>