RFR: 8230565: ZGC: Redesign C2 load barrier to expand on the MachNode level

Tue Oct 8 13:30:49 UTC 2019

Looks good!

/Per

On 10/8/19 2:10 PM, Nils Eliasson wrote:
> Updated webrev:
> 
> http://cr.openjdk.java.net/~neliasso/superlate/webrev.05/
> 
> Includes some very minor fixes to make builds pass on all platforms.
> 
> // Nils
> 
> 
> On 2019-10-08 09:20, Nils Eliasson wrote:
>> Hi all,
>>
>> Here is the latest patch 
>> http://cr.openjdk.java.net/~neliasso/superlate/webrev.04/
>>
>> This is the full implementation, stable on x64 and aarch64.
>>
>> // Nils
>>
>>
>> On 2019-09-26 16:56, Nils Eliasson wrote:
>>> Hi,
>>>
>>> Now I have update the patch in place: 
>>> http://cr.openjdk.java.net/~neliasso/aarch_barriers/aarch_super_late
>>>
>>> Status:
>>>
>>> Quite stable - hits a problem with monitor_enter_complete - probably 
>>> a bug in cas/cmpx. I'm looking into it
>>>
>>> Use optimized spilling with -XX:+UseNewCode2 - a little less stable.
>>>
>>> Regards,
>>>
>>> Nils
>>>
>>>
>>> On 2019-09-26 16:26, Stuart Monteith wrote:
>>>> Hello,
>>>>     Yes, I'll try this out - I was in the middle of my own
>>>> implementation, but I haven't advanced it as far as Nils.
>>>>
>>>> BR,
>>>>   Stuart
>>>>
>>>> On Sat, 21 Sep 2019 at 09:11, Per Liden <per.liden at oracle.com> wrote:
>>>>> Thanks Nils!
>>>>>
>>>>> Stuart, could you take this for a spin and help Nils track down the 
>>>>> last
>>>>> few issues?
>>>>>
>>>>> We're eager to push this as soon as we feel comfortable that the 
>>>>> aarch64
>>>>> bits are good enough.
>>>>>
>>>>> cheers,
>>>>> Per
>>>>>
>>>>> On 9/20/19 4:50 PM, Nils Eliasson wrote:
>>>>>> Hi,
>>>>>>
>>>>>> This is an attempt of porting the redesigned C2 barriers to aarch64.
>>>>>>
>>>>>> Status
>>>>>>
>>>>>> * Mostly complete - ad-files should be complete although not fully
>>>>>> tested yet
>>>>>>
>>>>>> * Spilling is implemented but not optimized
>>>>>>
>>>>>> * Runs a bit, but far from stable.
>>>>>>
>>>>>>
>>>>>> Patch: http://cr.openjdk.java.net/~neliasso/aarch_barriers/
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Nils
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2019-09-04 14:58, Erik Österlund wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> For many years we have expanded load barriers in the C2 sea of 
>>>>>>> nodes.
>>>>>>> It has been a constant struggle to keep up with bugs due to
>>>>>>> optimizations breaking our barriers. It has never truly worked. One
>>>>>>> particular pain point that has never been handled quite right up 
>>>>>>> until
>>>>>>> now, is dealing with safepoints ending up between a load and its 
>>>>>>> load
>>>>>>> barrier. We have had workarounds for that before, but they have 
>>>>>>> never
>>>>>>> really been enough.
>>>>>>>
>>>>>>> In the end, our barrier is only a conditional branch to a slow path,
>>>>>>> so there is really not much that the optimizer can do to help us 
>>>>>>> make
>>>>>>> that better. But it has many creative ways of breaking our GC 
>>>>>>> invariants.
>>>>>>>
>>>>>>> I think we have finally had enough of this, and want to move the
>>>>>>> barrier expansion to the MachNode level instead. This way, we can
>>>>>>> finally put an end to the load and its load barrier being separated
>>>>>>> (and related even more complicated issues for atomics).
>>>>>>>
>>>>>>> Our new solution is to tag accesses that want load barriers during
>>>>>>> parsing, and then let C2 optimize whatever it wants to, 
>>>>>>> invariantly of
>>>>>>> GC barriers. Then it will match mach nodes, and perform global code
>>>>>>> motion and scheduling. Only right before dumping the machine code of
>>>>>>> the resulting graph do we call into the barrier set to perform last
>>>>>>> second analysis of barriers, and then during machine code 
>>>>>>> dumping, we
>>>>>>> inject our load barriers. After the normal insts() are dumped, we
>>>>>>> inject slow path stubs for our barriers.
>>>>>>>
>>>>>>> There are two optimizations that we would like to retain in this 
>>>>>>> scheme.
>>>>>>>
>>>>>>> Optimization 1: Dominating barrier analysis
>>>>>>> Previously, we instantiated a PhaseIdealLoop instance to analyze
>>>>>>> dominating barriers. That was convenient because a dominator tree is
>>>>>>> available for finding load barriers that dominate other load 
>>>>>>> barriers
>>>>>>> in the CFG. I built a new more precise analysis on the PhaseCFG 
>>>>>>> level
>>>>>>> instead, happening after the matching to mach nodes. The analysis is
>>>>>>> now looking for dominating accesses, instead of dominating load
>>>>>>> barriers. Because any dominating access, including stores, will make
>>>>>>> sure that what is left behind in memory is "good". Another thing 
>>>>>>> that
>>>>>>> makes the analysis more precise, is that it doesn't require strict
>>>>>>> dominance in the CFG. If the earlier access is in the same Block 
>>>>>>> as an
>>>>>>> access with barriers, we now also utilize knowledge about the
>>>>>>> scheduling of the instructions, which has completed at this 
>>>>>>> point. So
>>>>>>> we can safely remove such pointless load barriers in the same block
>>>>>>> now. The analysis is performed right before machine code is emitted,
>>>>>>> so we can trust that it won't change after the analysis due to
>>>>>>> optimizations.
>>>>>>>
>>>>>>> Optimization 2: Tight register spilling
>>>>>>> When we call the slow path of our barriers, we want to spill only 
>>>>>>> the
>>>>>>> registers that are live. Previously, we had a special
>>>>>>> LoadBarrierSlowReg node corresponding to the slow path, that killed
>>>>>>> all XMM registers, and then all general purpose registers were 
>>>>>>> called
>>>>>>> in the slow path. Now we instead perform explicit live analysis 
>>>>>>> of our
>>>>>>> registers on MachNodes, including how large chunks of vector 
>>>>>>> registers
>>>>>>> are being used, and spill only exactly the registers that are live
>>>>>>> (and only the part of the register that is live for XMM/YMM/ZMM
>>>>>>> registers).
>>>>>>>
>>>>>>> Zooming out a bit, all complexity of pulling the barriers in the sea
>>>>>>> of nodes through various interesting phases while retaining GC
>>>>>>> invariants such as "don't put safepoints in my barrier", become
>>>>>>> trivial and no longer an issue. We simply tag our loads to need
>>>>>>> barriers, and let C2 do whatever it wants to in the sea of nodes. 
>>>>>>> Once
>>>>>>> all scheduling is done, we do our thing. Hopefully this will make 
>>>>>>> the
>>>>>>> barriers as stable and resilient as our C1 barriers, which cause
>>>>>>> trouble extremely rarely.
>>>>>>>
>>>>>>> We have run a number of benchmarks. We have observed a number of
>>>>>>> improvements, but never any regressions. There has been countless 
>>>>>>> runs
>>>>>>> through gc-test-suite, and a few hs-tier1-6 and his tier1-7 runs.
>>>>>>>
>>>>>>> Finally, I would like to thank Per and StefanK for the many hours
>>>>>>> spent on helping me with this patch, both in terms of spotting flaws
>>>>>>> in my prototypes, benchmarking, testing, and refactoring so the code
>>>>>>> looks nice and much more understandable. I will add both to the
>>>>>>> Contributed-by line.
>>>>>>>
>>>>>>> @Stuart: It would be awesome if you could provide some AArch64 bits
>>>>>>> for this patch so we do things the same way (ish).
>>>>>>>
>>>>>>> Bug:
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8230565
>>>>>>>
>>>>>>> Webrev:
>>>>>>> http://cr.openjdk.java.net/~eosterlund/8230565/webrev.00/
>>>>>>>
>>>>>>> Thanks,
>>>>>>> /Erik