RFR: 8230565: ZGC: Redesign C2 load barrier to expand on the MachNode level

Tue Oct 8 07:20:44 UTC 2019

Hi all,

Here is the latest patch 
http://cr.openjdk.java.net/~neliasso/superlate/webrev.04/

This is the full implementation, stable on x64 and aarch64.

// Nils

On 2019-09-26 16:56, Nils Eliasson wrote:
> Hi,
>
> Now I have update the patch in place: 
> http://cr.openjdk.java.net/~neliasso/aarch_barriers/aarch_super_late
>
> Status:
>
> Quite stable - hits a problem with monitor_enter_complete - probably a 
> bug in cas/cmpx. I'm looking into it
>
> Use optimized spilling with -XX:+UseNewCode2 - a little less stable.
>
> Regards,
>
> Nils
>
>
> On 2019-09-26 16:26, Stuart Monteith wrote:
>> Hello,
>>     Yes, I'll try this out - I was in the middle of my own
>> implementation, but I haven't advanced it as far as Nils.
>>
>> BR,
>>   Stuart
>>
>> On Sat, 21 Sep 2019 at 09:11, Per Liden <per.liden at oracle.com> wrote:
>>> Thanks Nils!
>>>
>>> Stuart, could you take this for a spin and help Nils track down the 
>>> last
>>> few issues?
>>>
>>> We're eager to push this as soon as we feel comfortable that the 
>>> aarch64
>>> bits are good enough.
>>>
>>> cheers,
>>> Per
>>>
>>> On 9/20/19 4:50 PM, Nils Eliasson wrote:
>>>> Hi,
>>>>
>>>> This is an attempt of porting the redesigned C2 barriers to aarch64.
>>>>
>>>> Status
>>>>
>>>> * Mostly complete - ad-files should be complete although not fully
>>>> tested yet
>>>>
>>>> * Spilling is implemented but not optimized
>>>>
>>>> * Runs a bit, but far from stable.
>>>>
>>>>
>>>> Patch: http://cr.openjdk.java.net/~neliasso/aarch_barriers/
>>>>
>>>> Regards,
>>>>
>>>> Nils
>>>>
>>>>
>>>>
>>>> On 2019-09-04 14:58, Erik Österlund wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> For many years we have expanded load barriers in the C2 sea of nodes.
>>>>> It has been a constant struggle to keep up with bugs due to
>>>>> optimizations breaking our barriers. It has never truly worked. One
>>>>> particular pain point that has never been handled quite right up 
>>>>> until
>>>>> now, is dealing with safepoints ending up between a load and its load
>>>>> barrier. We have had workarounds for that before, but they have never
>>>>> really been enough.
>>>>>
>>>>> In the end, our barrier is only a conditional branch to a slow path,
>>>>> so there is really not much that the optimizer can do to help us make
>>>>> that better. But it has many creative ways of breaking our GC 
>>>>> invariants.
>>>>>
>>>>> I think we have finally had enough of this, and want to move the
>>>>> barrier expansion to the MachNode level instead. This way, we can
>>>>> finally put an end to the load and its load barrier being separated
>>>>> (and related even more complicated issues for atomics).
>>>>>
>>>>> Our new solution is to tag accesses that want load barriers during
>>>>> parsing, and then let C2 optimize whatever it wants to, 
>>>>> invariantly of
>>>>> GC barriers. Then it will match mach nodes, and perform global code
>>>>> motion and scheduling. Only right before dumping the machine code of
>>>>> the resulting graph do we call into the barrier set to perform last
>>>>> second analysis of barriers, and then during machine code dumping, we
>>>>> inject our load barriers. After the normal insts() are dumped, we
>>>>> inject slow path stubs for our barriers.
>>>>>
>>>>> There are two optimizations that we would like to retain in this 
>>>>> scheme.
>>>>>
>>>>> Optimization 1: Dominating barrier analysis
>>>>> Previously, we instantiated a PhaseIdealLoop instance to analyze
>>>>> dominating barriers. That was convenient because a dominator tree is
>>>>> available for finding load barriers that dominate other load barriers
>>>>> in the CFG. I built a new more precise analysis on the PhaseCFG level
>>>>> instead, happening after the matching to mach nodes. The analysis is
>>>>> now looking for dominating accesses, instead of dominating load
>>>>> barriers. Because any dominating access, including stores, will make
>>>>> sure that what is left behind in memory is "good". Another thing that
>>>>> makes the analysis more precise, is that it doesn't require strict
>>>>> dominance in the CFG. If the earlier access is in the same Block 
>>>>> as an
>>>>> access with barriers, we now also utilize knowledge about the
>>>>> scheduling of the instructions, which has completed at this point. So
>>>>> we can safely remove such pointless load barriers in the same block
>>>>> now. The analysis is performed right before machine code is emitted,
>>>>> so we can trust that it won't change after the analysis due to
>>>>> optimizations.
>>>>>
>>>>> Optimization 2: Tight register spilling
>>>>> When we call the slow path of our barriers, we want to spill only the
>>>>> registers that are live. Previously, we had a special
>>>>> LoadBarrierSlowReg node corresponding to the slow path, that killed
>>>>> all XMM registers, and then all general purpose registers were called
>>>>> in the slow path. Now we instead perform explicit live analysis of 
>>>>> our
>>>>> registers on MachNodes, including how large chunks of vector 
>>>>> registers
>>>>> are being used, and spill only exactly the registers that are live
>>>>> (and only the part of the register that is live for XMM/YMM/ZMM
>>>>> registers).
>>>>>
>>>>> Zooming out a bit, all complexity of pulling the barriers in the sea
>>>>> of nodes through various interesting phases while retaining GC
>>>>> invariants such as "don't put safepoints in my barrier", become
>>>>> trivial and no longer an issue. We simply tag our loads to need
>>>>> barriers, and let C2 do whatever it wants to in the sea of nodes. 
>>>>> Once
>>>>> all scheduling is done, we do our thing. Hopefully this will make the
>>>>> barriers as stable and resilient as our C1 barriers, which cause
>>>>> trouble extremely rarely.
>>>>>
>>>>> We have run a number of benchmarks. We have observed a number of
>>>>> improvements, but never any regressions. There has been countless 
>>>>> runs
>>>>> through gc-test-suite, and a few hs-tier1-6 and his tier1-7 runs.
>>>>>
>>>>> Finally, I would like to thank Per and StefanK for the many hours
>>>>> spent on helping me with this patch, both in terms of spotting flaws
>>>>> in my prototypes, benchmarking, testing, and refactoring so the code
>>>>> looks nice and much more understandable. I will add both to the
>>>>> Contributed-by line.
>>>>>
>>>>> @Stuart: It would be awesome if you could provide some AArch64 bits
>>>>> for this patch so we do things the same way (ish).
>>>>>
>>>>> Bug:
>>>>> https://bugs.openjdk.java.net/browse/JDK-8230565
>>>>>
>>>>> Webrev:
>>>>> http://cr.openjdk.java.net/~eosterlund/8230565/webrev.00/
>>>>>
>>>>> Thanks,
>>>>> /Erik