8234160: ZGC: Enable optimized mitigation for Intel jcc erratum in C2 load barrier

Wed Feb 19 16:01:17 UTC 2020

Hi Tobias,

On 2/19/20 9:50 AM, Tobias Hartmann wrote:
> Hi Erik,
>
> thanks a lot for all the effort you've put into this.
>
> Webrev.01 looks good to me. Some minor comments:
> - Copyright needs to be updated
> - vm_version_x86.hpp:895 "crosses or ends" -> "cross or end"
> - c2_intelJccErratum_x86.cpp:51 Please add parentheses

Thanks for the review. I applied your suggested polishing, and the new 
webrev is here:
http://cr.openjdk.java.net/~eosterlund/8234160/webrev.02/

> I think we should also file a follow-up bug/enhancement to keep track of possible mitigations to
> other code.

Sure. Filed this one:
https://bugs.openjdk.java.net/browse/JDK-8239472

> Did you evaluate if Intel's mitigation has a performance impact on these highly used
> intrinsics/stubs mentioned by Vladimir?

Eric Caspole helped me with the evaluation (he ran a *lot* of stuff. I'm 
not sure if micro benchmarks for those
specific stubs/intrinsics were run. I only know that among all the 
benchmarks he ran (most of them I think actually
do something meaningful), the mitigation was effective at removing the 
throughput overheads that the micro code introduced.
There were only a few examples (apart from startup-benchmarks that are 
out of scope for this change) like some
XML sub-benchmark of SPECjvm2008 where a few percent were lost,comparing 
baseline (no micro code, without my patch)
to the mode of having the micro code and my patch. Butnotably, the same 
regression existed comparing the baseline
to a machine without the micro code and my patch.

So it seems like the act of nop sprinkling has bloated the code cache a 
bit in a way that cost a few percent
in that benchmark, as opposed to there being a sign such regression is 
due to missing out important branches in
stubs/C2 intrinsics.Is it possible that some stubs are left suboptimal? 
Yes. But not what we have seen stick out
in results in practice. I'm sure you can construct a program that turns 
this into a problem, but the aim of this
change is to just fix "most of the problems" occurring in practice in a 
way that is as unintrusive and correct as possible.

> Could you also change the title of the bug to be non-ZGC specific?

Sure.Changed it to: Enable optimized mitigation for Intel jcc erratum in C2

Thanks,
/Erik

> Best regards,
> Tobias
>
> On 14.02.20 17:46, erik.osterlund at oracle.com wrote:
>> Hi everyone,
>>
>> Eric Caspole helped me evaluate my solution thoroughly. Thanks a lot for that. The results were good.
>> In practice, my patch mitigated almost all of the regressions in C2 code caused by the JCC erratum
>> microcode mitigation.
>> There were a few cases where the nop sprinkling did hurt a few percent, possibly due to instruction
>> cache pollution.
>>
>> Because of the positive outcome of the experiments, I would like to propose we go with my
>> opt-in & safety-first based solution that I already posted earlier.
>>
>> I have talked to Vladimir, and I think we at this point in time agree about doing this.
>>
>> As a reminder, here is my webrev that I proposed (+ rebase that I just made):
>> http://cr.openjdk.java.net/~eosterlund/8234160/webrev.01/
>>
>> Thanks,
>> /Erik
>>
>> On 11/25/19 4:31 PM, Vladimir Ivanov wrote:
>>> Hi Erik,
>>>
>>>>> But I'd include stubs as well. Many of them are extensively used from C2-generated code.
>>>> Okay. Any specific stubs you have in mind?If there are some critical ones, we can sprinkle some
>>>> scope objects like I did in the ZGC code.
>>> There are intrinsics for compressed strings [1], numerous copy stubs [2], trigonometric functions
>>> [3].
>>>
>>> It would be unfortunate if we have to go over all that code and manually instrument all the places
>>> where problematic instructions are issued. Moreover, the process has to be repeated for new code
>>> being added over time.
>>>
>>>> I do have concerns though about injecting magic into the MacroAssembler that tries to solve this
>>>> automagically on the assembly level, by having the assembler spit out different
>>>> instructions than you requested.
>>>> The following comment from assembler.hpp captures my thought exactly:
>>>>
>>>> 207: // The Abstract Assembler: Pure assembler doing NO optimizations on the
>>>> 208: // instruction level; i.e., what you write is what you get.
>>>> 209: // The Assembler is generating code into a CodeBuffer.
>>> While I see that Assembler follows that (instruction per method), MacroAssembler does not: there
>>> are cases when generated code differ depending on runtime flags (e.g., verification code) or input
>>> values (e.g., whether AddressLiteral is reachable or not).
>>>
>>>> I think it is desirable to keep the property that when we tell the *Assembler to generate a __
>>>> cmp(); __ jcc(); it will do exactly that.
>>>> When such assumptions break, any code that has calculated the size of instructions, making
>>>> assumptions about their size, will fail.
>>>> For example, any MachNode with hardcoded size() might underestimate how much memory is really
>>>> needed, and code such as nmethod entry barriers
>>>> that have calculated the offset to the cmp immediate might suddenly stop working because. There
>>>> is similar code for oop maps where we
>>>> calculate offsets into mach nodes with oop maps to describe the PC after a call, which will stop
>>>> working:
>>>>
>>>> // !!!!! Special hack to get all types of calls to specify the byte offset
>>>> //       from the start of the call to the point where the return address
>>>> //       will point.
>>>> int MachCallStaticJavaNode::ret_addr_offset()
>>>> {
>>>>     int offset = 5; // 5 bytes from start of call to where return address points
>>>>     offset += clear_avx_size();
>>>>     return offset;
>>>> }
>>>>
>>>> Basically, I think you might be able to mitigate more branches on the MacroAssembler layer, but I
>>>> think it would also be more risky, as code that was
>>>> not built for having random size will start failing, in places we didn't think of.I can think of
>>>> a few, and feel like there are probably other places I have not thought about.
>>>>
>>>> So from that point of view, I think I would rather to this on Mach nodes where it is safe, and I
>>>> think we can catch the most important ones there,
>>>> and miss a few branches that the macro assembler would have found with magic, than apply it to
>>>> all branches and hope we find all the bugs due to unexpected magic.
>>>>
>>>> Do you agree? Or perhaps I misunderstood what you are suggesting.
>>> You raise a valid point: there are places in the VM which rely on hard-coded instruction
>>> sequences. If such instruction changes, all relevant places have to be adjusted. And JVM is
>>> already very cautious about such cases.
>>>
>>> I agree with you that MacroAssembler-based more risky, but IMO the risk is modest (few places are
>>> affected) and manageable (dedicated stress mode should greatly improve test effectiveness).
>>>
>>> My opinion is that if we are satisfied with the coverage C2 CFG instrumentation provides and don't
>>> expect any more work on mitigations, then there's no motivation in investing into
>>> MacroAssembler-based approach.
>>>
>>> Otherwise, there are basically 2 options:
>>>
>>>    * "opt-in": explicitly mark all the places where mitigations are applied, by default nothing is
>>> mitigated
>>>
>>>    * "opt-out": mitigate everything unless mitigations are explicitly disabled
>>>
>>> Both approaches provide fine-grained control over what's being mitigated, but with "opt-out"
>>> there's more code to care about: it's easy to miss important cases and too tempting to enable more
>>> than we are 100% certain about.
>>>
>>> Both can be applied to individual CFG nodes and make CFG instrumentation redundant.
>>>
>>> But if there's a need to instrument large portions of (macro)assembly code, then IMO opt-in adds
>>> too much in terms of work required, noise (on code level), maintenance, and burden for future code
>>> changes. So, I don't consider it as a feasible option in such situation.
>>>
>>> It looks like a mixture of opt-in (explicitly enable in some context: in C2 during code emission,
>>> particular stub generation, etc) and opt-out (on the level of individual instructions) gives the
>>> best of both approaches.
>>>
>>> But, again, if C2 CFG instrumentation is good enough, then it'll be a wasted effort.
>>>
>>> So, I envision 3 possible scenarios:
>>>
>>>    (1) just instrument Mach IR and be done with it;
>>>
>>>    (2) (a) start with Mach IR;
>>>        (b) later it turns out that extensive portions of (macro)assembly code have to me
>>> instrumented (or, for example, C1/Interpreter)
>>>        (c) implement MacroAssembler mitigations
>>>
>>>    (3) start with MacroAssembler mitigations and be done with it
>>>       * doesn't perclude gradual roll out across different subsystems
>>>
>>> Mach IR instrumentation (#1/#2) is the safest variant, but it may require more work.
>>>
>>> #3 is broadly applicable, but also riskier.
>>>
>>> What I don't consider as a viable option is C2 CFG instrumentation accompanied by numerous
>>> per-instruction mitigations scattered across the code base.
>>>
>>>>>> I have made a prototype, what this might look like and it looks like this:
>>>>>> http://cr.openjdk.java.net/~eosterlund/8234160/webrev.01/
>>>>> Just one more comment: it's weird to see intel_jcc_erratum referenced in shared code. You could
>>>>> #ifdef it for x86-only, but it's much better to move the code to x86-specific location.
>>>> Sure, I can move that to an x86 file and make it build only on x86_64.
>>> Yes, sounds good. But let's agree on general direction first.
>>>
>>> Best regards,
>>> Vladimir Ivanov
>>>
>>> [1] http://hg.openjdk.java.net/jdk/jdk/file/tip/src/hotspot/cpu/x86/macroAssembler_x86.hpp#l1666
>>>
>>> [2] http://hg.openjdk.java.net/jdk/jdk/file/623722a6aeb9/src/hotspot/cpu/x86/stubGenerator_x86_64.cpp
>>>
>>> [3] http://hg.openjdk.java.net/jdk/jdk/file/tip/src/hotspot/cpu/x86/
>>>      macroAssembler_x86_(sin|cos|...).cpp
>>>