RFR: 8230565: ZGC: Redesign C2 load barrier to expand on the MachNode level

Tue Oct 8 13:37:06 UTC 2019

For the non-Aarch64 parts: Looks good!

Reviewed.

// Nils

On 2019-10-08 15:30, Per Liden wrote:
> Looks good!
>
> /Per
>
> On 10/8/19 2:10 PM, Nils Eliasson wrote:
>> Updated webrev:
>>
>> http://cr.openjdk.java.net/~neliasso/superlate/webrev.05/
>>
>> Includes some very minor fixes to make builds pass on all platforms.
>>
>> // Nils
>>
>>
>> On 2019-10-08 09:20, Nils Eliasson wrote:
>>> Hi all,
>>>
>>> Here is the latest patch 
>>> http://cr.openjdk.java.net/~neliasso/superlate/webrev.04/
>>>
>>> This is the full implementation, stable on x64 and aarch64.
>>>
>>> // Nils
>>>
>>>
>>> On 2019-09-26 16:56, Nils Eliasson wrote:
>>>> Hi,
>>>>
>>>> Now I have update the patch in place: 
>>>> http://cr.openjdk.java.net/~neliasso/aarch_barriers/aarch_super_late
>>>>
>>>> Status:
>>>>
>>>> Quite stable - hits a problem with monitor_enter_complete - 
>>>> probably a bug in cas/cmpx. I'm looking into it
>>>>
>>>> Use optimized spilling with -XX:+UseNewCode2 - a little less stable.
>>>>
>>>> Regards,
>>>>
>>>> Nils
>>>>
>>>>
>>>> On 2019-09-26 16:26, Stuart Monteith wrote:
>>>>> Hello,
>>>>>     Yes, I'll try this out - I was in the middle of my own
>>>>> implementation, but I haven't advanced it as far as Nils.
>>>>>
>>>>> BR,
>>>>>   Stuart
>>>>>
>>>>> On Sat, 21 Sep 2019 at 09:11, Per Liden <per.liden at oracle.com> wrote:
>>>>>> Thanks Nils!
>>>>>>
>>>>>> Stuart, could you take this for a spin and help Nils track down 
>>>>>> the last
>>>>>> few issues?
>>>>>>
>>>>>> We're eager to push this as soon as we feel comfortable that the 
>>>>>> aarch64
>>>>>> bits are good enough.
>>>>>>
>>>>>> cheers,
>>>>>> Per
>>>>>>
>>>>>> On 9/20/19 4:50 PM, Nils Eliasson wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> This is an attempt of porting the redesigned C2 barriers to 
>>>>>>> aarch64.
>>>>>>>
>>>>>>> Status
>>>>>>>
>>>>>>> * Mostly complete - ad-files should be complete although not fully
>>>>>>> tested yet
>>>>>>>
>>>>>>> * Spilling is implemented but not optimized
>>>>>>>
>>>>>>> * Runs a bit, but far from stable.
>>>>>>>
>>>>>>>
>>>>>>> Patch: http://cr.openjdk.java.net/~neliasso/aarch_barriers/
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Nils
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2019-09-04 14:58, Erik Österlund wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> For many years we have expanded load barriers in the C2 sea of 
>>>>>>>> nodes.
>>>>>>>> It has been a constant struggle to keep up with bugs due to
>>>>>>>> optimizations breaking our barriers. It has never truly worked. 
>>>>>>>> One
>>>>>>>> particular pain point that has never been handled quite right 
>>>>>>>> up until
>>>>>>>> now, is dealing with safepoints ending up between a load and 
>>>>>>>> its load
>>>>>>>> barrier. We have had workarounds for that before, but they have 
>>>>>>>> never
>>>>>>>> really been enough.
>>>>>>>>
>>>>>>>> In the end, our barrier is only a conditional branch to a slow 
>>>>>>>> path,
>>>>>>>> so there is really not much that the optimizer can do to help 
>>>>>>>> us make
>>>>>>>> that better. But it has many creative ways of breaking our GC 
>>>>>>>> invariants.
>>>>>>>>
>>>>>>>> I think we have finally had enough of this, and want to move the
>>>>>>>> barrier expansion to the MachNode level instead. This way, we can
>>>>>>>> finally put an end to the load and its load barrier being 
>>>>>>>> separated
>>>>>>>> (and related even more complicated issues for atomics).
>>>>>>>>
>>>>>>>> Our new solution is to tag accesses that want load barriers during
>>>>>>>> parsing, and then let C2 optimize whatever it wants to, 
>>>>>>>> invariantly of
>>>>>>>> GC barriers. Then it will match mach nodes, and perform global 
>>>>>>>> code
>>>>>>>> motion and scheduling. Only right before dumping the machine 
>>>>>>>> code of
>>>>>>>> the resulting graph do we call into the barrier set to perform 
>>>>>>>> last
>>>>>>>> second analysis of barriers, and then during machine code 
>>>>>>>> dumping, we
>>>>>>>> inject our load barriers. After the normal insts() are dumped, we
>>>>>>>> inject slow path stubs for our barriers.
>>>>>>>>
>>>>>>>> There are two optimizations that we would like to retain in 
>>>>>>>> this scheme.
>>>>>>>>
>>>>>>>> Optimization 1: Dominating barrier analysis
>>>>>>>> Previously, we instantiated a PhaseIdealLoop instance to analyze
>>>>>>>> dominating barriers. That was convenient because a dominator 
>>>>>>>> tree is
>>>>>>>> available for finding load barriers that dominate other load 
>>>>>>>> barriers
>>>>>>>> in the CFG. I built a new more precise analysis on the PhaseCFG 
>>>>>>>> level
>>>>>>>> instead, happening after the matching to mach nodes. The 
>>>>>>>> analysis is
>>>>>>>> now looking for dominating accesses, instead of dominating load
>>>>>>>> barriers. Because any dominating access, including stores, will 
>>>>>>>> make
>>>>>>>> sure that what is left behind in memory is "good". Another 
>>>>>>>> thing that
>>>>>>>> makes the analysis more precise, is that it doesn't require strict
>>>>>>>> dominance in the CFG. If the earlier access is in the same 
>>>>>>>> Block as an
>>>>>>>> access with barriers, we now also utilize knowledge about the
>>>>>>>> scheduling of the instructions, which has completed at this 
>>>>>>>> point. So
>>>>>>>> we can safely remove such pointless load barriers in the same 
>>>>>>>> block
>>>>>>>> now. The analysis is performed right before machine code is 
>>>>>>>> emitted,
>>>>>>>> so we can trust that it won't change after the analysis due to
>>>>>>>> optimizations.
>>>>>>>>
>>>>>>>> Optimization 2: Tight register spilling
>>>>>>>> When we call the slow path of our barriers, we want to spill 
>>>>>>>> only the
>>>>>>>> registers that are live. Previously, we had a special
>>>>>>>> LoadBarrierSlowReg node corresponding to the slow path, that 
>>>>>>>> killed
>>>>>>>> all XMM registers, and then all general purpose registers were 
>>>>>>>> called
>>>>>>>> in the slow path. Now we instead perform explicit live analysis 
>>>>>>>> of our
>>>>>>>> registers on MachNodes, including how large chunks of vector 
>>>>>>>> registers
>>>>>>>> are being used, and spill only exactly the registers that are live
>>>>>>>> (and only the part of the register that is live for XMM/YMM/ZMM
>>>>>>>> registers).
>>>>>>>>
>>>>>>>> Zooming out a bit, all complexity of pulling the barriers in 
>>>>>>>> the sea
>>>>>>>> of nodes through various interesting phases while retaining GC
>>>>>>>> invariants such as "don't put safepoints in my barrier", become
>>>>>>>> trivial and no longer an issue. We simply tag our loads to need
>>>>>>>> barriers, and let C2 do whatever it wants to in the sea of 
>>>>>>>> nodes. Once
>>>>>>>> all scheduling is done, we do our thing. Hopefully this will 
>>>>>>>> make the
>>>>>>>> barriers as stable and resilient as our C1 barriers, which cause
>>>>>>>> trouble extremely rarely.
>>>>>>>>
>>>>>>>> We have run a number of benchmarks. We have observed a number of
>>>>>>>> improvements, but never any regressions. There has been 
>>>>>>>> countless runs
>>>>>>>> through gc-test-suite, and a few hs-tier1-6 and his tier1-7 runs.
>>>>>>>>
>>>>>>>> Finally, I would like to thank Per and StefanK for the many hours
>>>>>>>> spent on helping me with this patch, both in terms of spotting 
>>>>>>>> flaws
>>>>>>>> in my prototypes, benchmarking, testing, and refactoring so the 
>>>>>>>> code
>>>>>>>> looks nice and much more understandable. I will add both to the
>>>>>>>> Contributed-by line.
>>>>>>>>
>>>>>>>> @Stuart: It would be awesome if you could provide some AArch64 
>>>>>>>> bits
>>>>>>>> for this patch so we do things the same way (ish).
>>>>>>>>
>>>>>>>> Bug:
>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8230565
>>>>>>>>
>>>>>>>> Webrev:
>>>>>>>> http://cr.openjdk.java.net/~eosterlund/8230565/webrev.00/
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> /Erik