RFR 8134802 - LCM register pressure scheduling

Wed Sep 16 01:36:48 UTC 2015

The parameter liveout will never be referenced in lower_pressure in LCM context, it is intentionally guarded for not entering when scheduling for register pressure.  I did in fact experiment around with different values for both float and int pressure thresholds.  The versions in changelist work best for x64 and x86.  Other machine targets will need to verify and tune as they enable this algorithm.

Thanks,
Michael

-----Original Message-----
From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com] 
Sent: Tuesday, September 15, 2015 4:14 PM
To: Berg, Michael C; Dean Long; hotspot-compiler-dev at openjdk.java.net
Subject: Re: RFR 8134802 - LCM register pressure scheduling

I don't want these changes affect normal RA. Currently they are isolated and it is good.
Improvement to RA code should be done separately.

 > Am I wrong thinking the register pressure units and limits in LCM should be the same as in chaitin.cpp?

Not necessary. The goal is different in LCM and RA.
In RA we try to find a concrete register assigned to LRG.
In LCM we try to find instructions sequence which lower registers pressure in this block.

Yes, from code support POV - to have less confusion, it would be nice to have them the same. But it may not enough to get best results.

I hope Michael did experiments with different approaches to get what he has now. I wish he can add more comments explaining his algorithm to avoid confusions.

One thing is bothering me is calls to lower_pressure() from LCM new code which passes 'liveout' parameter as NULL:

_regalloc->lower_pressure(block, 0, lrg_src, NULL, _regalloc->_sched_int_pressure, _regalloc->_sched_float_pressure);

but lower_pressure() use it in asserts and it may crash VM if it is NULL:

    assert(int_pressure.current_pressure() == count_int_pressure(liveout), "the int pressure is incorrect");

JPRT tests passed with this code so the only explanation I have is that this part of code was not executed when running tests but it is bug I think. Note, it is only problem in debug VM (which have asserts).

Thanks,
Vladimir

On 9/15/15 3:26 PM, Berg, Michael C wrote:
> I'll add some to the notes below.  My changes do extend FLOATPRESSURE which is also used in RA (with a new value for x64), however yes, I think that something like what I have could be integrated for finer granularity of pressure in RA as well.  I was thinking about it during the implementation, but wanted some time to test the change. I use the same liveness analysis RA uses, excepting that mine is still in SSA form. Also I notice that we do get the upper bank of register assignment via RA without the change when using AVX512.
>
> It pretty easy at this point to just apply the change to RA, however I will only be able to test x86 and x64.
> We will need a handshake set of tests for other targets if I do.
>
> Else we could wait...
>
> -Michael
>
> -----Original Message-----
> From: hotspot-compiler-dev 
> [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of 
> Dean Long
> Sent: Tuesday, September 15, 2015 3:02 PM
> To: hotspot-compiler-dev at openjdk.java.net
> Subject: Re: RFR 8134802 - LCM register pressure scheduling
>
> Vladimir, my concerns about LCM are mostly because of my understanding of the register pressure code in chaitin.cpp.  For example:
>
>           case Op_VecY:
> assert(Matcher::vector_size_supported(T_FLOAT,RegMask::SlotsPerVecY),
> "sanity");
>             assert(RegMask::num_registers(Op_VecY) == RegMask::SlotsPerVecY, "sanity");
>             assert(lrgmask.is_aligned_sets(RegMask::SlotsPerVecY),
> "vector should be aligned");
>             lrg.set_num_regs(RegMask::SlotsPerVecY);
>             lrg.set_reg_pressure(1);
>
> My understanding is that set_num_regs() is in 32-bit words, like C2 
> uses for register.  However,
> set_reg_pressure() is different, and represents how much "pressure"
> increases for each register
> used, towards the limit of FLOATPRESSURE (unscaled) in this case. This is fine if one vector register fits in one float register, but that's not the case for ARM32, for example. But the LCM changes use a scaled FLOATPRESSURE, so it makes me wonder if chaitin.cpp also needs to do something similar.
>
> To add to my confusion, the fact that set_num_regs() and
> set_reg_pressure() have different units, both for vectors and  other types, like Op_RegP and Op_RegL, makes me wonder if code like the following:
>
>           case MachProjNode::fat_proj:
>             // Fat projections have size equal to number of registers killed
>             lrg.set_num_regs(rm.Size());
>             lrg.set_reg_pressure(lrg.num_regs());
>
> is doing the right thing when it sets both set_num_regs and set_reg_pressure to the same size.  Am I wrong thinking the register pressure units and limits in LCM should be the same as in chaitin.cpp?
>
> dl
>
>
> On 9/15/2015 2:21 PM, Vladimir Kozlov wrote:
>> Scaling is used only to set _high_pressure_limit. See new method 
>> Pressure::init(int limit).
>> Late in lcm _high_pressure_limit is used as threshold to determine if 
>> register pressure should be used scheduling factor:
>>
>> +          // Now evaluate each register pressure component based on
>> threshold in the score.
>> +          // In general the defining register type will dominate the
>> score, ergo we will not see register pressure grow on both banks
>> +          // on a single instruction, but we might see it shrink on
>> both banks.
>> +          if (_regalloc->_sched_int_pressure.current_pressure() >
>> _regalloc->_sched_int_pressure.high_pressure_limit()) {
>> +            short int_pressure = (short)recalc_pressure_nodes[n->_idx];
>> +            n_score = (int_pressure < 0) ? ((score + n_score) -
>> int_pressure) : (int_pressure > 0) ? 1 : n_score;
>> +          }
>> +          if (_regalloc->_sched_float_pressure.current_pressure() >
>> _regalloc->_sched_float_pressure.high_pressure_limit()) {
>> +            short float_pressure =
>> (short)(recalc_pressure_nodes[n->_idx] >> 16);
>> +            n_score = (float_pressure < 0) ? ((score + n_score) -
>> float_pressure) : (float_pressure > 0) ? 1 : n_score;
>> +          }
>>
>> May be comments need to be more clear. Also would be nice to explain 
>> the expression used to calculate n_score here.
>>
>> Thanks,
>> Vladimir
>>
>> On 9/15/15 1:05 PM, Berg, Michael C wrote:
>>> Dean:
>>>
>>> EVEX has a wider vector and more registers.  We go from VecY to VecZ 
>>> and from 16 xmms to 32 xmms.  So I really do mean 2x.
>>> Perhaps it would be cleaner if it returned actual threshold value of 
>>> registers to trigger the algorithm (managed by the .ad file 
>>> function).  Then if a machine arbitrarily changes the number of 
>>> float registers by something other than a scalable multiple (say you 
>>> went from 32 to 48 registers), then we would have a fully accurate 
>>> way to express the notion.  Or if you need to express various 
>>> designs all of which have different numbers of float registers 
>>> available.  This would give us fine granularity of control vs the 
>>> course grain we now have via FLOATPRESSURE which is compile time 
>>> defined.  I like this approach better still, so I will augment it 
>>> and put the code up later today if nobody has objections.
>>>
>>> Thanks,
>>> Michael
>>>
>>> -----Original Message-----
>>> From: hotspot-compiler-dev
>>> [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of 
>>> Dean Long
>>> Sent: Tuesday, September 15, 2015 11:26 AM
>>> To: hotspot-compiler-dev at openjdk.java.net
>>> Subject: Re: RFR 8134802 - LCM register pressure scheduling
>>>
>>> Michael, can you explain more about Matcher::float_pressure_scale() 
>>> and why it is needed?
>>> I'm trying to understand what the correct value would be for other 
>>> platforms that have vectors, like arm32 and aarch64.
>>>
>>> I think I understand how FLOATPRESSURE is used elsewhere (without 
>>> scaling), but I don't understand why only lcm.cpp needs it scaled, 
>>> and I don't understand exactly what the scaling is doing.
>>>
>>>> The reason I need it is for EVEX enabled uarch machines, which have 
>>>> 2x more xmms.
>>>
>>> I think you mean 2x bits per register, not 2x number of registers.
>>>
>>> dl
>>>
>>> On 9/14/2015 8:46 PM, Berg, Michael C wrote:
>>>> Vladimir, I have made the requested additions from below. I left 
>>>> the object instances not under guard of  the register pressure code 
>>>> as the instances are local scope objects which are passed by 
>>>> reference to the objects I use.  It makes the code more muddy to 
>>>> try to separate the paths  to maintain the scope and use the flag/check.
>>>> Also I retested on the usual suspects (benchmarks,  jtreg, etc) 
>>>> with no issues.  The octane suite shows only a little over 1% 
>>>> increase in
>>>> C2 time with the new algorithm. Please see the updated webrev at 
>>>> this link:
>>>>
>>>> http://cr.openjdk.java.net/~mcberg/8134802/webrev.03/
>>>>
>>>> Thanks,
>>>> Michael
>>>>
>>>> -----Original Message-----
>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>>> Sent: Monday, September 14, 2015 1:09 AM
>>>> To: Berg, Michael C; 'hotspot-compiler-dev at openjdk.java.net'
>>>> Subject: Re: RFR 8134802 - LCM register pressure scheduling
>>>>
>>>> Yes, I talked about pdf attached to bug report.
>>>>
>>>> Sorry, for some reasons I thought it was reverse - C2 time 
>>>> increased when optimization is on.
>>>> I should have look on positive %. Yes, I agree that 10% c2 time 
>>>> increase for some benchamrks is acceptable if it brings performance 
>>>> improvement.
>>>>
>>>> Thanks,
>>>> Vladimir
>>>>
>>>> On 9/14/15 12:46 AM, Berg, Michael C wrote:
>>>>> Ok, for a moment Vladimir I thought you reran the tests and had a 
>>>>> different result.  You are quoting the pdf data.  The data in the 
>>>>> pdf is from the webrev.01, where the register pressure on data was 
>>>>> 3.26s for all of C2 and 5.33s for register pressure off.
>>>>> So that's a sharp decrease in C2 time due to register pressure 
>>>>> scheduling being on, which is caused by saving spill code.
>>>>>
>>>>> -Michael
>>>>>
>>>>> -----Original Message-----
>>>>> From: Berg, Michael C
>>>>> Sent: Sunday, September 13, 2015 11:29 PM
>>>>> To: Vladimir Kozlov; hotspot-compiler-dev at openjdk.java.net
>>>>> Subject: RE: RFR 8134802 - LCM register pressure scheduling
>>>>>
>>>>> Vladimir, I need to know some things about your run. Machine spec, 
>>>>> which compiler x86 or x64, etc.
>>>>> Sure I will run the nashorn metic.  Further guarding the code will 
>>>>> not buy us much in overhead avoidance (as in the suggestion 
>>>>> below), but I will see what  I can do.
>>>>> For now the vector size check will work, but as soon as some other 
>>>>> uarch has a Z vector, we will have to revisit this.
>>>>> The reason I need it is for EVEX enabled uarch machines, which 
>>>>> have 2x more xmms.
>>>>>
>>>>> -----Original Message-----
>>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>>>> Sent: Friday, September 11, 2015 8:58 PM
>>>>> To: Berg, Michael C; hotspot-compiler-dev at openjdk.java.net
>>>>> Subject: Re: RFR 8134802 - LCM register pressure scheduling
>>>>>
>>>>> Looks good.
>>>>>
>>>>> I looked on performance data and for scimark.lu.large C2 time 
>>>>> increase significantly (~ 39%) while score did not improve (0,18%).
>>>>> I can accept compilation time regression if it gives performance 
>>>>> improvement as crypto.aes. But otherwise we need to investigate 
>>>>> why that happens.
>>>>>
>>>>> Can you rerun this on sub-benchmark to see if it repeated?
>>>>>
>>>>> Also, please, do performance run for nashorn as Aleksey suggested.
>>>>>
>>>>> RA code at the beginning of gcm.cpp is not guarded by 
>>>>> OptoRegScheduling.
>>>>> I think you can put guard around all that new code including:
>>>>> _regalloc = ®alloc;
>>>>>
>>>>> Also JPRT reported build failures:
>>>>>
>>>>> hotspot/src/share/vm/opto/lcm.cpp:999:9: error: 'UseAVX' was not 
>>>>> declared in this scope
>>>>>
>>>>>           if (UseAVX > 2) {
>>>>>             float_pressure *= 2;
>>>>>
>>>>> UseAVX is x86 platform-specific.  Why you need to increase 
>>>>> float_pressure? If you really need it you can check:
>>>>>
>>>>>       if (Matcher::max_vector_size(T_DOUBLE) > 4)
>>>>>
>>>>> Thanks,
>>>>> Vladimir
>>>>>
>>>>> On 9/11/15 10:43 AM, Berg, Michael C wrote:
>>>>>> Vladimir, please see the latest update at:
>>>>>>
>>>>>> http://cr.openjdk.java.net/~mcberg/8134802/webrev.02/
>>>>>>
>>>>>> I have made the node change from below to share flag definitions 
>>>>>> (reduction/scheduling).
>>>>>> I also added code to screen out methods with only small blocks 
>>>>>> for live range analysis and register pressure scheduling.
>>>>>> For methods which have some larger blocks we now screen out the 
>>>>>> small blocks as well.  Meaning, overhead Is by and large not an 
>>>>>> issue as I see x64 and x86 C2 time not affected by my algorithm 
>>>>>> with any scheduling budget being offset by time not spent 
>>>>>> register allocation.
>>>>>>
>>>>>> Thanks,
>>>>>> Michael
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>>>>> Sent: Thursday, September 10, 2015 6:04 PM
>>>>>> To: Berg, Michael C; hotspot-compiler-dev at openjdk.java.net
>>>>>> Subject: Re: RFR 8134802 - LCM register pressure scheduling
>>>>>>
>>>>>> On 9/10/15 12:11 PM, Berg, Michael C wrote:
>>>>>>> Ok, I can make is_reduction and is_scheduled have the same value.
>>>>>>> Since I'm clearing it during init processing that will work 
>>>>>>> quite well. Nobody downstream processes reductions.
>>>>>>>
>>>>>>> Problem:
>>>>>>>
>>>>>>> The C++ standard implements enum as int sized, we should union 
>>>>>>> _flags with NodeFlags and increase NodeFlags to juint. We would 
>>>>>>> actually decrease the amount of storage in node by doing so 
>>>>>>> since right now storage for NodeFlags is additive with _flags.  
>>>>>>> We would get 16 more flag slots and make node smaller.
>>>>>> NodeFlags is type, there is no a field in Node class with 
>>>>>> NodeFlags type.  NodeFlags is only used to define flags values 
>>>>>> which are used to set bits in _flags. So I am not sure what you 
>>>>>> are proposing.
>>>>>>
>>>>>> Thanks,
>>>>>> Vladimir
>>>>>>
>>>>>>> Michael
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>>>>>> Sent: Wednesday, September 09, 2015 8:29 PM
>>>>>>> To: Berg, Michael C; hotspot-compiler-dev at openjdk.java.net
>>>>>>> Subject: Re: RFR 8134802 - LCM register pressure scheduling
>>>>>>>
>>>>>>> We only have 3 bits left since total is 16:
>>>>>>>
>>>>>>> jushort _flags;
>>>>>>>
>>>>>>> You have Flag_is_reduction which is used only in loop 
>>>>>>> opts/superword. So you can overlap these flags.
>>>>>>>
>>>>>>> We need to clean up this (no you, Michael). We have flags which 
>>>>>>> are used only by Ideal node (Flag_is_macro, Flag_is_expensive).
>>>>>>> And flags used by Mach nodes (5 flags). We may try to overlap them.
>>>>>>>
>>>>>>> Vladimir
>>>>>>>
>>>>>>> On 9/9/15 7:34 PM, Berg, Michael C wrote:
>>>>>>>> All, please see the link:
>>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8134802
>>>>>>>>
>>>>>>>> As I have uploaded a performance report for data collected 
>>>>>>>> with/wo register pressure scheduling. I would like to keep the 
>>>>>>>> node flag in place, we have room for 15 more flags after this 
>>>>>>>> one is added, and this is a formal phase of C2 and so a good 
>>>>>>>> use of one the flags.  The addition of VectorSet would 
>>>>>>>> incrementally raise the overhead of the algorithm. Please have 
>>>>>>>> a look and comment as needed.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Michael
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>>>>>>> Sent: Friday, September 04, 2015 6:42 PM
>>>>>>>> To: Berg, Michael C; hotspot-compiler-dev at openjdk.java.net
>>>>>>>> Subject: Re: RFR 8134802 - LCM register pressure scheduling
>>>>>>>>
>>>>>>>> Impressive work. Thank you for reusing current RA functionality.
>>>>>>>>
>>>>>>>> "is very minimal" - how minimal? 2% or 10%?
>>>>>>>>
>>>>>>>> Did it gave any performance improvement? Changes are 
>>>>>>>> significant and should be justified.
>>>>>>>>
>>>>>>>> Changes look reasonable. I only notice one thing:
>>>>>>>> Flag bits in Node is very precious to use for node's state 
>>>>>>>> tracking. Why not use VectorSet?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Vladimir
>>>>>>>>
>>>>>>>> On 9/4/15 1:33 PM, Berg, Michael C wrote:
>>>>>>>>> Hi Folks,
>>>>>>>>>
>>>>>>>>> I would like to contribute LCM register pressure scheduling. I 
>>>>>>>>> need two reviewers to examine this patch and comment as needed:
>>>>>>>>>
>>>>>>>>> Bug-id: https://bugs.openjdk.java.net/browse/JDK-8134802
>>>>>>>>>
>>>>>>>>> webrev:
>>>>>>>>>
>>>>>>>>> http://cr.openjdk.java.net/~mcberg/8134802/webrev.01/
>>>>>>>>>
>>>>>>>>> These changes calculate register pressure at the entry of a 
>>>>>>>>> basic block, at the end and incrementally while we are 
>>>>>>>>> scheduling. It uses an efficient algorithm for recalculating 
>>>>>>>>> register pressure on a as needed basis. The algorithm uses 
>>>>>>>>> heuristics to switch to a pressure based algorithm to reduce 
>>>>>>>>> spills for int and float registers using thresholds for each.
>>>>>>>>> It also uses weights which count on a per register class basis 
>>>>>>>>> to dope ready list candidate choice while scheduling so that 
>>>>>>>>> we reduce register pressure when possible. Once we fall over 
>>>>>>>>> either threshold, we start trying mitigate pressure upon the 
>>>>>>>>> affected class of registers which are over the limit. This 
>>>>>>>>> happens on both register classes and/or separately for each. 
>>>>>>>>> We switch back to latency scheduling when pressure is alleviated.
>>>>>>>>> As before we obey hard artifacts such as barriers, fences and such.
>>>>>>>>> Overhead for constructing and providing liveness information 
>>>>>>>>> and the additional algorithmic usage is very minimal, so as 
>>>>>>>>> affect compile time minimally.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Michael
>>>>>>>>>
>>>
>