RFR 8134802 - LCM register pressure scheduling
Dean Long
dean.long at oracle.com
Tue Sep 15 21:22:55 UTC 2015
Thanks Michael, that sounds promising.
I'm still fuzzy on how the LCM changes relate to the register pressure
settings done in PhaseChaitin::gather_lrg_masks(), so I may have more
questions after I see your new webrev. If new Matcher functions like
Matcher::float_pressure_scale() we be used to get rid of the
platform-specific ifdefs in gather_lrg_masks() at the same time, that
would be great.
dl
On 9/15/2015 1:05 PM, Berg, Michael C wrote:
> Dean:
>
> EVEX has a wider vector and more registers. We go from VecY to VecZ and from 16 xmms to 32 xmms. So I really do mean 2x.
> Perhaps it would be cleaner if it returned actual threshold value of registers to trigger the algorithm (managed by the .ad file function). Then if a machine arbitrarily changes the number of float registers by something other than a scalable multiple (say you went from 32 to 48 registers), then we would have a fully accurate way to express the notion. Or if you need to express various designs all of which have different numbers of float registers available. This would give us fine granularity of control vs the course grain we now have via FLOATPRESSURE which is compile time defined. I like this approach better still, so I will augment it and put the code up later today if nobody has objections.
>
> Thanks,
> Michael
>
> -----Original Message-----
> From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Dean Long
> Sent: Tuesday, September 15, 2015 11:26 AM
> To: hotspot-compiler-dev at openjdk.java.net
> Subject: Re: RFR 8134802 - LCM register pressure scheduling
>
> Michael, can you explain more about Matcher::float_pressure_scale() and why it is needed?
> I'm trying to understand what the correct value would be for other platforms that have vectors, like arm32 and aarch64.
>
> I think I understand how FLOATPRESSURE is used elsewhere (without scaling), but I don't understand why only lcm.cpp needs it scaled, and I don't understand exactly what the scaling is doing.
>
>> The reason I need it is for EVEX enabled uarch machines, which have 2x more xmms.
> I think you mean 2x bits per register, not 2x number of registers.
>
> dl
>
> On 9/14/2015 8:46 PM, Berg, Michael C wrote:
>> Vladimir, I have made the requested additions from below. I left the object instances not under guard of the register pressure code as the instances are local scope objects which are passed by reference to the objects I use. It makes the code more muddy to try to separate the paths to maintain the scope and use the flag/check. Also I retested on the usual suspects (benchmarks, jtreg, etc) with no issues. The octane suite shows only a little over 1% increase in C2 time with the new algorithm. Please see the updated webrev at this link:
>>
>> http://cr.openjdk.java.net/~mcberg/8134802/webrev.03/
>>
>> Thanks,
>> Michael
>>
>> -----Original Message-----
>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>> Sent: Monday, September 14, 2015 1:09 AM
>> To: Berg, Michael C; 'hotspot-compiler-dev at openjdk.java.net'
>> Subject: Re: RFR 8134802 - LCM register pressure scheduling
>>
>> Yes, I talked about pdf attached to bug report.
>>
>> Sorry, for some reasons I thought it was reverse - C2 time increased when optimization is on.
>> I should have look on positive %. Yes, I agree that 10% c2 time increase for some benchamrks is acceptable if it brings performance improvement.
>>
>> Thanks,
>> Vladimir
>>
>> On 9/14/15 12:46 AM, Berg, Michael C wrote:
>>> Ok, for a moment Vladimir I thought you reran the tests and had a different result. You are quoting the pdf data. The data in the pdf is from the webrev.01, where the register pressure on data was 3.26s for all of C2 and 5.33s for register pressure off.
>>> So that's a sharp decrease in C2 time due to register pressure scheduling being on, which is caused by saving spill code.
>>>
>>> -Michael
>>>
>>> -----Original Message-----
>>> From: Berg, Michael C
>>> Sent: Sunday, September 13, 2015 11:29 PM
>>> To: Vladimir Kozlov; hotspot-compiler-dev at openjdk.java.net
>>> Subject: RE: RFR 8134802 - LCM register pressure scheduling
>>>
>>> Vladimir, I need to know some things about your run. Machine spec, which compiler x86 or x64, etc.
>>> Sure I will run the nashorn metic. Further guarding the code will not buy us much in overhead avoidance (as in the suggestion below), but I will see what I can do.
>>> For now the vector size check will work, but as soon as some other uarch has a Z vector, we will have to revisit this.
>>> The reason I need it is for EVEX enabled uarch machines, which have 2x more xmms.
>>>
>>> -----Original Message-----
>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>> Sent: Friday, September 11, 2015 8:58 PM
>>> To: Berg, Michael C; hotspot-compiler-dev at openjdk.java.net
>>> Subject: Re: RFR 8134802 - LCM register pressure scheduling
>>>
>>> Looks good.
>>>
>>> I looked on performance data and for scimark.lu.large C2 time increase significantly (~ 39%) while score did not improve (0,18%).
>>> I can accept compilation time regression if it gives performance improvement as crypto.aes. But otherwise we need to investigate why that happens.
>>>
>>> Can you rerun this on sub-benchmark to see if it repeated?
>>>
>>> Also, please, do performance run for nashorn as Aleksey suggested.
>>>
>>> RA code at the beginning of gcm.cpp is not guarded by OptoRegScheduling.
>>> I think you can put guard around all that new code including:
>>> _regalloc = ®alloc;
>>>
>>> Also JPRT reported build failures:
>>>
>>> hotspot/src/share/vm/opto/lcm.cpp:999:9: error: 'UseAVX' was not
>>> declared in this scope
>>>
>>> if (UseAVX > 2) {
>>> float_pressure *= 2;
>>>
>>> UseAVX is x86 platform-specific. Why you need to increase float_pressure? If you really need it you can check:
>>>
>>> if (Matcher::max_vector_size(T_DOUBLE) > 4)
>>>
>>> Thanks,
>>> Vladimir
>>>
>>> On 9/11/15 10:43 AM, Berg, Michael C wrote:
>>>> Vladimir, please see the latest update at:
>>>>
>>>> http://cr.openjdk.java.net/~mcberg/8134802/webrev.02/
>>>>
>>>> I have made the node change from below to share flag definitions (reduction/scheduling).
>>>> I also added code to screen out methods with only small blocks for live range analysis and register pressure scheduling.
>>>> For methods which have some larger blocks we now screen out the
>>>> small blocks as well. Meaning, overhead Is by and large not an issue as I see x64 and x86 C2 time not affected by my algorithm with any scheduling budget being offset by time not spent register allocation.
>>>>
>>>> Thanks,
>>>> Michael
>>>>
>>>> -----Original Message-----
>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>>> Sent: Thursday, September 10, 2015 6:04 PM
>>>> To: Berg, Michael C; hotspot-compiler-dev at openjdk.java.net
>>>> Subject: Re: RFR 8134802 - LCM register pressure scheduling
>>>>
>>>> On 9/10/15 12:11 PM, Berg, Michael C wrote:
>>>>> Ok, I can make is_reduction and is_scheduled have the same value. Since I'm clearing it during init processing that will work quite well. Nobody downstream processes reductions.
>>>>>
>>>>> Problem:
>>>>>
>>>>> The C++ standard implements enum as int sized, we should union _flags with NodeFlags and increase NodeFlags to juint. We would actually decrease the amount of storage in node by doing so since right now storage for NodeFlags is additive with _flags. We would get 16 more flag slots and make node smaller.
>>>> NodeFlags is type, there is no a field in Node class with NodeFlags type. NodeFlags is only used to define flags values which are used to set bits in _flags. So I am not sure what you are proposing.
>>>>
>>>> Thanks,
>>>> Vladimir
>>>>
>>>>> Michael
>>>>>
>>>>> -----Original Message-----
>>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>>>> Sent: Wednesday, September 09, 2015 8:29 PM
>>>>> To: Berg, Michael C; hotspot-compiler-dev at openjdk.java.net
>>>>> Subject: Re: RFR 8134802 - LCM register pressure scheduling
>>>>>
>>>>> We only have 3 bits left since total is 16:
>>>>>
>>>>> jushort _flags;
>>>>>
>>>>> You have Flag_is_reduction which is used only in loop opts/superword. So you can overlap these flags.
>>>>>
>>>>> We need to clean up this (no you, Michael). We have flags which are used only by Ideal node (Flag_is_macro, Flag_is_expensive). And flags used by Mach nodes (5 flags). We may try to overlap them.
>>>>>
>>>>> Vladimir
>>>>>
>>>>> On 9/9/15 7:34 PM, Berg, Michael C wrote:
>>>>>> All, please see the link:
>>>>>> https://bugs.openjdk.java.net/browse/JDK-8134802
>>>>>>
>>>>>> As I have uploaded a performance report for data collected with/wo register pressure scheduling. I would like to keep the node flag in place, we have room for 15 more flags after this one is added, and this is a formal phase of C2 and so a good use of one the flags. The addition of VectorSet would incrementally raise the overhead of the algorithm. Please have a look and comment as needed.
>>>>>>
>>>>>> Thanks,
>>>>>> Michael
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>>>>> Sent: Friday, September 04, 2015 6:42 PM
>>>>>> To: Berg, Michael C; hotspot-compiler-dev at openjdk.java.net
>>>>>> Subject: Re: RFR 8134802 - LCM register pressure scheduling
>>>>>>
>>>>>> Impressive work. Thank you for reusing current RA functionality.
>>>>>>
>>>>>> "is very minimal" - how minimal? 2% or 10%?
>>>>>>
>>>>>> Did it gave any performance improvement? Changes are significant and should be justified.
>>>>>>
>>>>>> Changes look reasonable. I only notice one thing:
>>>>>> Flag bits in Node is very precious to use for node's state tracking. Why not use VectorSet?
>>>>>>
>>>>>> Thanks,
>>>>>> Vladimir
>>>>>>
>>>>>> On 9/4/15 1:33 PM, Berg, Michael C wrote:
>>>>>>> Hi Folks,
>>>>>>>
>>>>>>> I would like to contribute LCM register pressure scheduling. I
>>>>>>> need two reviewers to examine this patch and comment as needed:
>>>>>>>
>>>>>>> Bug-id: https://bugs.openjdk.java.net/browse/JDK-8134802
>>>>>>>
>>>>>>> webrev:
>>>>>>>
>>>>>>> http://cr.openjdk.java.net/~mcberg/8134802/webrev.01/
>>>>>>>
>>>>>>> These changes calculate register pressure at the entry of a basic
>>>>>>> block, at the end and incrementally while we are scheduling. It
>>>>>>> uses an efficient algorithm for recalculating register pressure
>>>>>>> on a as needed basis. The algorithm uses heuristics to switch to
>>>>>>> a pressure based algorithm to reduce spills for int and float
>>>>>>> registers using thresholds for each. It also uses weights which
>>>>>>> count on a per register class basis to dope ready list candidate
>>>>>>> choice while scheduling so that we reduce register pressure when
>>>>>>> possible. Once we fall over either threshold, we start trying
>>>>>>> mitigate pressure upon the affected class of registers which are
>>>>>>> over the limit. This happens on both register classes and/or
>>>>>>> separately for each. We switch back to latency scheduling when
>>>>>>> pressure is alleviated. As before we obey hard artifacts such as barriers, fences and such.
>>>>>>> Overhead for constructing and providing liveness information and
>>>>>>> the additional algorithmic usage is very minimal, so as affect compile time minimally.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Michael
>>>>>>>
More information about the hotspot-compiler-dev
mailing list