RFR 8134802 - LCM register pressure scheduling

Tue Sep 15 22:01:42 UTC 2015

Vladimir, my concerns about LCM are mostly because of my understanding 
of the register
pressure code in chaitin.cpp.  For example:

         case Op_VecY:
assert(Matcher::vector_size_supported(T_FLOAT,RegMask::SlotsPerVecY), 
"sanity");
           assert(RegMask::num_registers(Op_VecY) == 
RegMask::SlotsPerVecY, "sanity");
           assert(lrgmask.is_aligned_sets(RegMask::SlotsPerVecY), 
"vector should be aligned");
           lrg.set_num_regs(RegMask::SlotsPerVecY);
           lrg.set_reg_pressure(1);

My understanding is that set_num_regs() is in 32-bit words, like C2 uses 
for register.  However,
set_reg_pressure() is different, and represents how much "pressure" 
increases for each register
used, towards the limit of FLOATPRESSURE (unscaled) in this case. This 
is fine if one vector register fits in
one float register, but that's not the case for ARM32, for example. But 
the LCM changes use a
scaled FLOATPRESSURE, so it makes me wonder if chaitin.cpp also needs to 
do something similar.

To add to my confusion, the fact that set_num_regs() and 
set_reg_pressure() have different units,
both for vectors and  other types, like Op_RegP and Op_RegL, makes me 
wonder if code like the following:

         case MachProjNode::fat_proj:
           // Fat projections have size equal to number of registers killed
           lrg.set_num_regs(rm.Size());
           lrg.set_reg_pressure(lrg.num_regs());

is doing the right thing when it sets both set_num_regs and 
set_reg_pressure to the same size.  Am I wrong
thinking the register pressure units and limits in LCM should be the 
same as in chaitin.cpp?

dl

On 9/15/2015 2:21 PM, Vladimir Kozlov wrote:
> Scaling is used only to set _high_pressure_limit. See new method 
> Pressure::init(int limit).
> Late in lcm _high_pressure_limit is used as threshold to determine if 
> register pressure should be used scheduling factor:
>
> +          // Now evaluate each register pressure component based on 
> threshold in the score.
> +          // In general the defining register type will dominate the 
> score, ergo we will not see register pressure grow on both banks
> +          // on a single instruction, but we might see it shrink on 
> both banks.
> +          if (_regalloc->_sched_int_pressure.current_pressure() > 
> _regalloc->_sched_int_pressure.high_pressure_limit()) {
> +            short int_pressure = (short)recalc_pressure_nodes[n->_idx];
> +            n_score = (int_pressure < 0) ? ((score + n_score) - 
> int_pressure) : (int_pressure > 0) ? 1 : n_score;
> +          }
> +          if (_regalloc->_sched_float_pressure.current_pressure() > 
> _regalloc->_sched_float_pressure.high_pressure_limit()) {
> +            short float_pressure = 
> (short)(recalc_pressure_nodes[n->_idx] >> 16);
> +            n_score = (float_pressure < 0) ? ((score + n_score) - 
> float_pressure) : (float_pressure > 0) ? 1 : n_score;
> +          }
>
> May be comments need to be more clear. Also would be nice to explain 
> the expression used to calculate n_score here.
>
> Thanks,
> Vladimir
>
> On 9/15/15 1:05 PM, Berg, Michael C wrote:
>> Dean:
>>
>> EVEX has a wider vector and more registers.  We go from VecY to VecZ 
>> and from 16 xmms to 32 xmms.  So I really do mean 2x.
>> Perhaps it would be cleaner if it returned actual threshold value of 
>> registers to trigger the algorithm (managed by the .ad file 
>> function).  Then if a machine arbitrarily changes the number of float 
>> registers by something other than a scalable multiple (say you went 
>> from 32 to 48 registers), then we would have a fully accurate way to 
>> express the notion.  Or if you need to express various designs all of 
>> which have different numbers of float registers available.  This 
>> would give us fine granularity of control vs the course grain we now 
>> have via FLOATPRESSURE which is compile time defined.  I like this 
>> approach better still, so I will augment it and put the code up later 
>> today if nobody has objections.
>>
>> Thanks,
>> Michael
>>
>> -----Original Message-----
>> From: hotspot-compiler-dev 
>> [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of 
>> Dean Long
>> Sent: Tuesday, September 15, 2015 11:26 AM
>> To: hotspot-compiler-dev at openjdk.java.net
>> Subject: Re: RFR 8134802 - LCM register pressure scheduling
>>
>> Michael, can you explain more about Matcher::float_pressure_scale() 
>> and why it is needed?
>> I'm trying to understand what the correct value would be for other 
>> platforms that have vectors, like arm32 and aarch64.
>>
>> I think I understand how FLOATPRESSURE is used elsewhere (without 
>> scaling), but I don't understand why only lcm.cpp needs it scaled, 
>> and I don't understand exactly what the scaling is doing.
>>
>>> The reason I need it is for EVEX enabled uarch machines, which have 
>>> 2x more xmms.
>>
>> I think you mean 2x bits per register, not 2x number of registers.
>>
>> dl
>>
>> On 9/14/2015 8:46 PM, Berg, Michael C wrote:
>>> Vladimir, I have made the requested additions from below. I left the 
>>> object instances not under guard of  the register pressure code as 
>>> the instances are local scope objects which are passed by reference 
>>> to the objects I use.  It makes the code more muddy to try to 
>>> separate the paths  to maintain the scope and use the flag/check.  
>>> Also I retested on the usual suspects (benchmarks,  jtreg, etc) with 
>>> no issues.  The octane suite shows only a little over 1% increase in 
>>> C2 time with the new algorithm. Please see the updated webrev at 
>>> this link:
>>>
>>> http://cr.openjdk.java.net/~mcberg/8134802/webrev.03/
>>>
>>> Thanks,
>>> Michael
>>>
>>> -----Original Message-----
>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>> Sent: Monday, September 14, 2015 1:09 AM
>>> To: Berg, Michael C; 'hotspot-compiler-dev at openjdk.java.net'
>>> Subject: Re: RFR 8134802 - LCM register pressure scheduling
>>>
>>> Yes, I talked about pdf attached to bug report.
>>>
>>> Sorry, for some reasons I thought it was reverse - C2 time increased 
>>> when optimization is on.
>>> I should have look on positive %. Yes, I agree that 10% c2 time 
>>> increase for some benchamrks is acceptable if it brings performance 
>>> improvement.
>>>
>>> Thanks,
>>> Vladimir
>>>
>>> On 9/14/15 12:46 AM, Berg, Michael C wrote:
>>>> Ok, for a moment Vladimir I thought you reran the tests and had a 
>>>> different result.  You are quoting the pdf data.  The data in the 
>>>> pdf is from the webrev.01, where the register pressure on data was 
>>>> 3.26s for all of C2 and 5.33s for register pressure off.
>>>> So that's a sharp decrease in C2 time due to register pressure 
>>>> scheduling being on, which is caused by saving spill code.
>>>>
>>>> -Michael
>>>>
>>>> -----Original Message-----
>>>> From: Berg, Michael C
>>>> Sent: Sunday, September 13, 2015 11:29 PM
>>>> To: Vladimir Kozlov; hotspot-compiler-dev at openjdk.java.net
>>>> Subject: RE: RFR 8134802 - LCM register pressure scheduling
>>>>
>>>> Vladimir, I need to know some things about your run. Machine spec, 
>>>> which compiler x86 or x64, etc.
>>>> Sure I will run the nashorn metic.  Further guarding the code will 
>>>> not buy us much in overhead avoidance (as in the suggestion below), 
>>>> but I will see what  I can do.
>>>> For now the vector size check will work, but as soon as some other 
>>>> uarch has a Z vector, we will have to revisit this.
>>>> The reason I need it is for EVEX enabled uarch machines, which have 
>>>> 2x more xmms.
>>>>
>>>> -----Original Message-----
>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>>> Sent: Friday, September 11, 2015 8:58 PM
>>>> To: Berg, Michael C; hotspot-compiler-dev at openjdk.java.net
>>>> Subject: Re: RFR 8134802 - LCM register pressure scheduling
>>>>
>>>> Looks good.
>>>>
>>>> I looked on performance data and for scimark.lu.large C2 time 
>>>> increase significantly (~ 39%) while score did not improve (0,18%).
>>>> I can accept compilation time regression if it gives performance 
>>>> improvement as crypto.aes. But otherwise we need to investigate why 
>>>> that happens.
>>>>
>>>> Can you rerun this on sub-benchmark to see if it repeated?
>>>>
>>>> Also, please, do performance run for nashorn as Aleksey suggested.
>>>>
>>>> RA code at the beginning of gcm.cpp is not guarded by 
>>>> OptoRegScheduling.
>>>> I think you can put guard around all that new code including:
>>>> _regalloc = ®alloc;
>>>>
>>>> Also JPRT reported build failures:
>>>>
>>>> hotspot/src/share/vm/opto/lcm.cpp:999:9: error: 'UseAVX' was not
>>>> declared in this scope
>>>>
>>>>          if (UseAVX > 2) {
>>>>            float_pressure *= 2;
>>>>
>>>> UseAVX is x86 platform-specific.  Why you need to increase 
>>>> float_pressure? If you really need it you can check:
>>>>
>>>>      if (Matcher::max_vector_size(T_DOUBLE) > 4)
>>>>
>>>> Thanks,
>>>> Vladimir
>>>>
>>>> On 9/11/15 10:43 AM, Berg, Michael C wrote:
>>>>> Vladimir, please see the latest update at:
>>>>>
>>>>> http://cr.openjdk.java.net/~mcberg/8134802/webrev.02/
>>>>>
>>>>> I have made the node change from below to share flag definitions 
>>>>> (reduction/scheduling).
>>>>> I also added code to screen out methods with only small blocks for 
>>>>> live range analysis and register pressure scheduling.
>>>>> For methods which have some larger blocks we now screen out the
>>>>> small blocks as well.  Meaning, overhead Is by and large not an 
>>>>> issue as I see x64 and x86 C2 time not affected by my algorithm 
>>>>> with any scheduling budget being offset by time not spent register 
>>>>> allocation.
>>>>>
>>>>> Thanks,
>>>>> Michael
>>>>>
>>>>> -----Original Message-----
>>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>>>> Sent: Thursday, September 10, 2015 6:04 PM
>>>>> To: Berg, Michael C; hotspot-compiler-dev at openjdk.java.net
>>>>> Subject: Re: RFR 8134802 - LCM register pressure scheduling
>>>>>
>>>>> On 9/10/15 12:11 PM, Berg, Michael C wrote:
>>>>>> Ok, I can make is_reduction and is_scheduled have the same 
>>>>>> value.  Since I'm clearing it during init processing that will 
>>>>>> work quite well. Nobody downstream processes reductions.
>>>>>>
>>>>>> Problem:
>>>>>>
>>>>>> The C++ standard implements enum as int sized, we should union 
>>>>>> _flags with NodeFlags and increase NodeFlags to juint. We would 
>>>>>> actually decrease the amount of storage in node by doing so since 
>>>>>> right now storage for NodeFlags is additive with _flags.  We 
>>>>>> would get 16 more flag slots and make node smaller.
>>>>> NodeFlags is type, there is no a field in Node class with 
>>>>> NodeFlags type.  NodeFlags is only used to define flags values 
>>>>> which are used to set bits in _flags. So I am not sure what you 
>>>>> are proposing.
>>>>>
>>>>> Thanks,
>>>>> Vladimir
>>>>>
>>>>>> Michael
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>>>>> Sent: Wednesday, September 09, 2015 8:29 PM
>>>>>> To: Berg, Michael C; hotspot-compiler-dev at openjdk.java.net
>>>>>> Subject: Re: RFR 8134802 - LCM register pressure scheduling
>>>>>>
>>>>>> We only have 3 bits left since total is 16:
>>>>>>
>>>>>> jushort _flags;
>>>>>>
>>>>>> You have Flag_is_reduction which is used only in loop 
>>>>>> opts/superword. So you can overlap these flags.
>>>>>>
>>>>>> We need to clean up this (no you, Michael). We have flags which 
>>>>>> are used only by Ideal node (Flag_is_macro, Flag_is_expensive). 
>>>>>> And flags used by Mach nodes (5 flags). We may try to overlap them.
>>>>>>
>>>>>> Vladimir
>>>>>>
>>>>>> On 9/9/15 7:34 PM, Berg, Michael C wrote:
>>>>>>> All, please see the link:
>>>>>>> https://bugs.openjdk.java.net/browse/JDK-8134802
>>>>>>>
>>>>>>> As I have uploaded a performance report for data collected 
>>>>>>> with/wo register pressure scheduling. I would like to keep the 
>>>>>>> node flag in place, we have room for 15 more flags after this 
>>>>>>> one is added, and this is a formal phase of C2 and so a good use 
>>>>>>> of one the flags.  The addition of VectorSet would incrementally 
>>>>>>> raise the overhead of the algorithm. Please have a look and 
>>>>>>> comment as needed.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Michael
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>>>>>> Sent: Friday, September 04, 2015 6:42 PM
>>>>>>> To: Berg, Michael C; hotspot-compiler-dev at openjdk.java.net
>>>>>>> Subject: Re: RFR 8134802 - LCM register pressure scheduling
>>>>>>>
>>>>>>> Impressive work. Thank you for reusing current RA functionality.
>>>>>>>
>>>>>>> "is very minimal" - how minimal? 2% or 10%?
>>>>>>>
>>>>>>> Did it gave any performance improvement? Changes are significant 
>>>>>>> and should be justified.
>>>>>>>
>>>>>>> Changes look reasonable. I only notice one thing:
>>>>>>> Flag bits in Node is very precious to use for node's state 
>>>>>>> tracking. Why not use VectorSet?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Vladimir
>>>>>>>
>>>>>>> On 9/4/15 1:33 PM, Berg, Michael C wrote:
>>>>>>>> Hi Folks,
>>>>>>>>
>>>>>>>> I would like to contribute LCM register pressure scheduling. I
>>>>>>>> need two reviewers to examine this patch and comment as needed:
>>>>>>>>
>>>>>>>> Bug-id: https://bugs.openjdk.java.net/browse/JDK-8134802
>>>>>>>>
>>>>>>>> webrev:
>>>>>>>>
>>>>>>>> http://cr.openjdk.java.net/~mcberg/8134802/webrev.01/
>>>>>>>>
>>>>>>>> These changes calculate register pressure at the entry of a basic
>>>>>>>> block, at the end and incrementally while we are scheduling. It
>>>>>>>> uses an efficient algorithm for recalculating register pressure
>>>>>>>> on a as needed basis. The algorithm uses heuristics to switch to
>>>>>>>> a pressure based algorithm to reduce spills for int and float
>>>>>>>> registers using thresholds for each. It also uses weights which
>>>>>>>> count on a per register class basis to dope ready list candidate
>>>>>>>> choice while scheduling so that we reduce register pressure when
>>>>>>>> possible. Once we fall over either threshold, we start trying
>>>>>>>> mitigate pressure upon the affected class of registers which are
>>>>>>>> over the limit. This happens on both register classes and/or
>>>>>>>> separately for each. We switch back to latency scheduling when
>>>>>>>> pressure is alleviated. As before we obey hard artifacts such 
>>>>>>>> as barriers, fences and such.
>>>>>>>> Overhead for constructing and providing liveness information and
>>>>>>>> the additional algorithmic usage is very minimal, so as affect 
>>>>>>>> compile time minimally.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Michael
>>>>>>>>
>>