RFR 8134802 - LCM register pressure scheduling

Mon Sep 14 08:08:58 UTC 2015

Yes, I talked about pdf attached to bug report.

Sorry, for some reasons I thought it was reverse - C2 time increased when optimization is on.
I should have look on positive %. Yes, I agree that 10% c2 time increase for some benchamrks is acceptable if it brings 
performance improvement.

Thanks,
Vladimir

On 9/14/15 12:46 AM, Berg, Michael C wrote:
> Ok, for a moment Vladimir I thought you reran the tests and had a different result.  You are quoting the pdf data.  The data in the pdf is from the webrev.01, where the register pressure on data was 3.26s for all of C2 and 5.33s for register pressure off.
> So that's a sharp decrease in C2 time due to register pressure scheduling being on, which is caused by saving spill code.
>
> -Michael
>
> -----Original Message-----
> From: Berg, Michael C
> Sent: Sunday, September 13, 2015 11:29 PM
> To: Vladimir Kozlov; hotspot-compiler-dev at openjdk.java.net
> Subject: RE: RFR 8134802 - LCM register pressure scheduling
>
> Vladimir, I need to know some things about your run.  Machine spec, which compiler x86 or x64, etc.
> Sure I will run the nashorn metic.  Further guarding the code will not buy us much in overhead avoidance (as in the suggestion below), but I will see what  I can do.
> For now the vector size check will work, but as soon as some other uarch has a Z vector, we will have to revisit this.
> The reason I need it is for EVEX enabled uarch machines, which have 2x more xmms.
>
> -----Original Message-----
> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
> Sent: Friday, September 11, 2015 8:58 PM
> To: Berg, Michael C; hotspot-compiler-dev at openjdk.java.net
> Subject: Re: RFR 8134802 - LCM register pressure scheduling
>
> Looks good.
>
> I looked on performance data and for scimark.lu.large C2 time increase significantly (~ 39%) while score did not improve (0,18%).
> I can accept compilation time regression if it gives performance improvement as crypto.aes. But otherwise we need to investigate why that happens.
>
> Can you rerun this on sub-benchmark to see if it repeated?
>
> Also, please, do performance run for nashorn as Aleksey suggested.
>
> RA code at the beginning of gcm.cpp is not guarded by OptoRegScheduling.
> I think you can put guard around all that new code including:
> _regalloc = ®alloc;
>
> Also JPRT reported build failures:
>
> hotspot/src/share/vm/opto/lcm.cpp:999:9: error: 'UseAVX' was not declared in this scope
>
>        if (UseAVX > 2) {
>          float_pressure *= 2;
>
> UseAVX is x86 platform-specific.  Why you need to increase float_pressure? If you really need it you can check:
>
>    if (Matcher::max_vector_size(T_DOUBLE) > 4)
>
> Thanks,
> Vladimir
>
> On 9/11/15 10:43 AM, Berg, Michael C wrote:
>> Vladimir, please see the latest update at:
>>
>> http://cr.openjdk.java.net/~mcberg/8134802/webrev.02/
>>
>> I have made the node change from below to share flag definitions (reduction/scheduling).
>> I also added code to screen out methods with only small blocks for live range analysis and register pressure scheduling.
>> For methods which have some larger blocks we now screen out the small
>> blocks as well.  Meaning, overhead Is by and large not an issue as I see x64 and x86 C2 time not affected by my algorithm with any scheduling budget being offset by time not spent register allocation.
>>
>> Thanks,
>> Michael
>>
>> -----Original Message-----
>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>> Sent: Thursday, September 10, 2015 6:04 PM
>> To: Berg, Michael C; hotspot-compiler-dev at openjdk.java.net
>> Subject: Re: RFR 8134802 - LCM register pressure scheduling
>>
>> On 9/10/15 12:11 PM, Berg, Michael C wrote:
>>> Ok, I can make is_reduction and is_scheduled have the same value.  Since I'm clearing it during init processing that will work quite well.  Nobody downstream processes reductions.
>>>
>>> Problem:
>>>
>>> The C++ standard implements enum as int sized, we should union _flags with NodeFlags and increase NodeFlags to juint. We would actually decrease the amount of storage in node by doing so since right now storage for NodeFlags is additive with _flags.  We would get 16 more flag slots and make node smaller.
>>
>> NodeFlags is type, there is no a field in Node class with NodeFlags type.  NodeFlags is only used to define flags values which are used to set bits in _flags. So I am not sure what you are proposing.
>>
>> Thanks,
>> Vladimir
>>
>>>
>>> Michael
>>>
>>> -----Original Message-----
>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>> Sent: Wednesday, September 09, 2015 8:29 PM
>>> To: Berg, Michael C; hotspot-compiler-dev at openjdk.java.net
>>> Subject: Re: RFR 8134802 - LCM register pressure scheduling
>>>
>>> We only have 3 bits left since total is 16:
>>>
>>> jushort _flags;
>>>
>>> You have Flag_is_reduction which is used only in loop opts/superword. So you can overlap these flags.
>>>
>>> We need to clean up this (no you, Michael). We have flags which are used only by Ideal node (Flag_is_macro, Flag_is_expensive). And flags used by Mach nodes (5 flags). We may try to overlap them.
>>>
>>> Vladimir
>>>
>>> On 9/9/15 7:34 PM, Berg, Michael C wrote:
>>>> All, please see the link:
>>>> https://bugs.openjdk.java.net/browse/JDK-8134802
>>>>
>>>> As I have uploaded a performance report for data collected with/wo register pressure scheduling. I would like to keep the node flag in place, we have room for 15 more flags after this one is added, and this is a formal phase of C2 and so a good use of one the flags.  The addition of VectorSet would incrementally raise the overhead of the algorithm. Please have a look and comment as needed.
>>>>
>>>> Thanks,
>>>> Michael
>>>>
>>>> -----Original Message-----
>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>>> Sent: Friday, September 04, 2015 6:42 PM
>>>> To: Berg, Michael C; hotspot-compiler-dev at openjdk.java.net
>>>> Subject: Re: RFR 8134802 - LCM register pressure scheduling
>>>>
>>>> Impressive work. Thank you for reusing current RA functionality.
>>>>
>>>> "is very minimal" - how minimal? 2% or 10%?
>>>>
>>>> Did it gave any performance improvement? Changes are significant and should be justified.
>>>>
>>>> Changes look reasonable. I only notice one thing:
>>>> Flag bits in Node is very precious to use for node's state tracking. Why not use VectorSet?
>>>>
>>>> Thanks,
>>>> Vladimir
>>>>
>>>> On 9/4/15 1:33 PM, Berg, Michael C wrote:
>>>>> Hi Folks,
>>>>>
>>>>> I would like to contribute LCM register pressure scheduling. I need
>>>>> two reviewers to examine this patch and comment as needed:
>>>>>
>>>>> Bug-id: https://bugs.openjdk.java.net/browse/JDK-8134802
>>>>>
>>>>> webrev:
>>>>>
>>>>> http://cr.openjdk.java.net/~mcberg/8134802/webrev.01/
>>>>>
>>>>> These changes calculate register pressure at the entry of a basic
>>>>> block, at the end and incrementally while we are scheduling. It
>>>>> uses an efficient algorithm for recalculating register pressure on
>>>>> a as needed basis. The algorithm uses heuristics to switch to a
>>>>> pressure based algorithm to reduce spills for int and float
>>>>> registers using thresholds for each. It also uses weights which
>>>>> count on a per register class basis to dope ready list candidate
>>>>> choice while scheduling so that we reduce register pressure when
>>>>> possible. Once we fall over either threshold, we start trying
>>>>> mitigate pressure upon the affected class of registers which are
>>>>> over the limit. This happens on both register classes and/or
>>>>> separately for each. We switch back to latency scheduling when
>>>>> pressure is alleviated. As before we obey hard artifacts such as barriers, fences and such.
>>>>> Overhead for constructing and providing liveness information and
>>>>> the additional algorithmic usage is very minimal, so as affect compile time minimally.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Michael
>>>>>