CR for RFR 8151573

Sat Apr 2 03:18:28 UTC 2016

I start preintegration testing.

Thanks,
Vladimir

On 3/31/16 8:36 PM, Berg, Michael C wrote:
> Vladimir, I think I have addressed every concern in the latest webrev:
>
> http://cr.openjdk.java.net/~mcberg/8151573/webrev.03/
>
> I wound up leaving divergent local copies of the clone code as there are two full separate contexts now in three locations.  Adding more parameters didn't seem to be a win to get around it.
> The code is fully retested with no issues.
>
> Thanks,
> Michael
>
> -----Original Message-----
> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
> Sent: Wednesday, March 30, 2016 9:56 PM
> To: Berg, Michael C <michael.c.berg at intel.com>; 'hotspot-compiler-dev at openjdk.java.net' <hotspot-compiler-dev at openjdk.java.net>
> Subject: Re: CR for RFR 8151573
>
> On 3/30/16 4:57 PM, Berg, Michael C wrote:
>> See below for context.
>>
>> Thanks,
>> Michael
>>
>> -----Original Message-----
>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>> Sent: Wednesday, March 30, 2016 3:51 PM
>> To: Berg, Michael C <michael.c.berg at intel.com>;
>> 'hotspot-compiler-dev at openjdk.java.net'
>> <hotspot-compiler-dev at openjdk.java.net>
>> Subject: Re: CR for RFR 8151573
>>
>> Michael,
>>
>> First, please update to latest sources of hs-comp. 8148754 changes modified the same files and it is difficult to apply your changes.
>>
>> multi_version_post_loops() can use is_canonical_main_loop_entry() from
>> 8148754 but you need to modify it to move
>> is_Main() assert to other call sites.
>>
>> <MCB> ok, but should not the name then be is_canonical_loop_entry()? Since it will be for non main loops as well. Also the canonical test is different.  I can leave the name, but it will be overloaded afterward with two types of functionality.  The grape shape test changes between main and post loops. Perhaps better to leave them separate given they are different?  I will leave this one to last so that we have time to discuss this.
>
> I am fine with name change. The only difference I see is that your code checks cur_cmp->in(1) opcode for Opaque1 vs in(2). The question is why you check in(1) - it should be in(2)?
>
>>
>> I did not get rce'd post loop checks in loopnode.cpp.
>>
>> <MCB> First I will have to explain what I am doing with do_range_check().  That code resolves the question of: Do all the range checks have load ranges, only these types of loops are canonical for multiversioning.
>> Basically that means that if RCE succeeds in removing all checks in main, the post loop may still have non-canonical forms that will not be easily handled by subsuming them into the multiversioned loops limit as a min test of all ranges and the limit.  In those cases we really do want to know if we have non-canonical forms, so I thinking leaving the code as is in do_range_check is key to that discovery.  Then insert_scalar_rced_post_loop(...) creates a copy of main which successfully passed range check elimination and is canonical.  Basically this short circuits us from adding any non canonical multiversions, leaving the code in its original form when we do not pass.
>> The copy of main works with the canonical range checks in the final loop to rework the limit of the newly inserted clean post loop so that we never execute a range check index, leaving it for the final loop to do so or for the optimizer to realize that no such case exists in which case the clean loop becomes the one and only copy.  If we cannot multiversion transform the loop we added we eliminate it.
>
> I understand that you want to make clean copy of main loop when it could be vectorized but before it is unrolled.
>
> You should add comment into do_range_check() about what you said (looking only for range checks). Also don't use old_new
>    struct but use simple counter and return it as result (or return bool from do_range_check()) to indicate that only range checks were in loop.
>
> Anyway, I was asked about checks in loopnode.cpp. First, there is some expectation about sequence of post loops. But from what I understand you are looking for a post loops which are not created by insert_scalar_rced_post_loop(). Right?
> There is no comment what it is searching for. There are 3 types of post loop now: atomic, rced and regular (from insert_pre_post_loops()).
> Why not mark rced loop (RCEPostLoop?) to simplify search instead of executing has_range_checks(cl)? What about AtomicPostLoop?
>
>>
>> Swap next checks since has_range_checks() may be expensive scanning loop body:
>> +  // only process RCE'd main loops
>> +  if (cl->has_range_checks() || !cl->is_main_loop()) return;
>>
>> <MCB> Ok, makes sense.
>
> Sorry, I made mistake due to you using the same name for 2 different methods. One is query cl->has_range_checks() which checks flag. And an other has_range_checks(cl) which search loop for If nodes. In such case you can ignore my check swap request.
> But, please, rename has_range_checks(cl) method to avoid confusion.
>
> Thanks,
> Vladimir
>
>>
>> Also if there are no checks in loop you will scan loop's body again since no_checks state is not recorded.
>> I would suggest to scan body unconditionally because code you added to do_range_check() may not precise (you decrement count only for LoadRange when increment for all Ifs). This way you don't need to change old_new around  do_range_check() call.
>>
>> <MCB> I perceive the real problem is don't scan more than once after we check. I will move towards that solution.
>>
>>
>> Why you need local copies?:
>>
>> -  visited.Clear();
>> -  clones.clear();
>> +  Arena *a = Thread::current()->resource_area();
>> +  VectorSet visited(a);
>> +  Node_Stack clones(a, main_head->back_control()->outcnt());
>>
>> <MCB> I will look into this, and see if it can be cleaned up.
>>
>>
>> I don't think PostLoopInfo is needed. insert_post_loop() can return post_head node and and &main_exit could be passed to it to set.
>>
>> <MCB> Ok, I will look into a version without PostLoopInfo.
>>
>> Thanks,
>> Vladimir
>>
>> On 3/30/16 1:44 PM, Berg, Michael C wrote:
>>> Here is an update after full testing, the webrev is:
>>>
>>> http://cr.openjdk.java.net/~mcberg/8151573/webrev.02/
>>>
>>> Please review and comment,
>>>
>>> Thanks,
>>> Michael
>>>
>>> -----Original Message-----
>>> From: hotspot-compiler-dev
>>> [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of
>>> Berg, Michael C
>>> Sent: Wednesday, March 16, 2016 10:30 AM
>>> To: Vladimir Kozlov <vladimir.kozlov at oracle.com>;
>>> 'hotspot-compiler-dev at openjdk.java.net'
>>> <hotspot-compiler-dev at openjdk.java.net>
>>> Subject: RE: CR for RFR 8151573
>>>
>>> Putting a hold on the review, retesting everything on my end.
>>>
>>> -----Original Message-----
>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>> Sent: Wednesday, March 16, 2016 8:42 AM
>>> To: Berg, Michael C; 'hotspot-compiler-dev at openjdk.java.net'
>>> Subject: Re: CR for RFR 8151573
>>>
>>> On 3/15/16 5:29 PM, Berg, Michael C wrote:
>>>> Vladimir:
>>>>
>>>> The why programmable SIMD depends upon this that all versions of the final post loop have range checks in them until very late, after register allocation, and might be cleaned up in cfg optimizations, but are not always so. With multiversioning, we always remove the range checks in our key loop.
>>>
>>> I understand that we can get some benefits. But in general case they will not be visible.
>>>
>>>>
>>>> With regards to the pre loop, pre loops have special checks too do they not, requiring flow in many cases?
>>>> Programmable SIMD needs tight loops to accurately facilitate masked iteration mapping.
>>>
>>> Yes, after you explained me vector masking I now understand why it could be used for post loop.
>>>
>>> After thinking about this I would suggest for you to look on arraycopy and generate_fill stubs instead in stub_Generator_x86*.cpp (may be only 64-bit). They also have post loops but changes would be only platform specific, smaller and easy to understand and test. Also arraycopy and 'fill' code are used very frequently by Java applications so we may get more benefits than optimizing general loops.
>>>
>>> Regards,
>>> Vladimir
>>>
>>>>
>>>> Regards,
>>>> Michael
>>>>
>>>> -----Original Message-----
>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>>> Sent: Tuesday, March 15, 2016 4:37 PM
>>>> To: Berg, Michael C; 'hotspot-compiler-dev at openjdk.java.net'
>>>> Subject: Re: CR for RFR 8151573
>>>>
>>>> As we all know we can always construct microbenchmarks which shows
>>>> 30%
>>>> - 50% difference. When in real application we will never see
>>>> difference. I still don't see a real reason why we should spend time
>>>> and optimize
>>>> *POST* loops. We already have vectorized post loop to improve performance. Note, additional loop opts code will rise its maintenance cost.
>>>>
>>>> Why "programmable SIMD" depends on it? What about pre-loop?
>>>>
>>>> Thanks,
>>>> Vladimir
>>>>
>>>> On 3/15/16 4:14 PM, Berg, Michael C wrote:
>>>>> Correction below...
>>>>>
>>>>> -----Original Message-----
>>>>> From: hotspot-compiler-dev
>>>>> [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of
>>>>> Berg, Michael C
>>>>> Sent: Tuesday, March 15, 2016 4:08 PM
>>>>> To: Vladimir Kozlov; 'hotspot-compiler-dev at openjdk.java.net'
>>>>> Subject: RE: CR for RFR 8151573
>>>>>
>>>>> Vladimir for programmable SIMD which is the optimization which uses this implementation, I get the following on micros and code in general that look like this:
>>>>>
>>>>>          for(int i = 0; i < process_len; i++)
>>>>>          {
>>>>>            d[i]= (a[i] * b[i]) + (a[i] * c[i]) + (b[i] * c[i]);
>>>>>          }
>>>>>
>>>>> The above code makes 9 vector ops.
>>>>>
>>>>> For float with vector length VecZ, I get as much as 1.3x and for int as much as 1.4x uplift.
>>>>> For double and long on VecZ it is smaller, but then so is the value of vectorization on those types anyways.
>>>>> The value process_len is some fraction of the array length in my measurements.  The idea of the metrics Is to pose a post loop with a modest amount of iterations in it.  For instance N is the max trip of the post loop, and N is 1..VecZ-1 size, then for float we could do as many as 15 iterations in the fixup loop.
>>>>>
>>>>> An example would be array_length = 512, process_len is a range of 81..96, we create a VecZ loop which was superunrolled 4 times with vector length 16, or unroll of 64, we align process 4 iterations, and the vectorized post loop is executed 1 time, leaving the remaining work in the final post loop, in this case possibly a mutilversioned post loop.  We start that final loop at iteration 81 so we always do at least 1 iteration fixup, and as many as 15.  If we left the fixup loop as a scalar loop that would mean 1 to 15 iterations plus our initial loops which have {4,1,1} iterations as a group or 6 to get us to index 80.  By vectorizing the fixup loop to one iteration we now always have 7 iterations in our loops for all ranges of 81..96, without this optimization and programmable SIMD, we would have the initial 6 plush 1 to 15 more, or a range of 7 to 21 iterations.
>>>>>
>>>>> Would you prefer I integrate this with programmable SIMD and submit the patches as one?
>>>>>
>>>>> I thought it would be easier to do them separately.  Also, exposing the post loops to this path offloads cfg processing to earlier compilation, making the graph less complex through register allocation.
>>>>>
>>>>> Regards,
>>>>> Michael
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
>>>>> Sent: Tuesday, March 15, 2016 2:42 PM
>>>>> To: Berg, Michael C; 'hotspot-compiler-dev at openjdk.java.net'
>>>>> Subject: Re: CR for RFR 8151573
>>>>>
>>>>> Hi Michael,
>>>>>
>>>>> Changes are significant so they have to be justified. Especially since we are in later stage of jdk9 development. Do you have performance numbers (not only for microbenchmarhks) which show the benefit of these changes?
>>>>>
>>>>> Thanks,
>>>>> Vladimir
>>>>>
>>>>> On 3/15/16 2:04 PM, Berg, Michael C wrote:
>>>>>> Hi Folks,
>>>>>>
>>>>>> I would like to contribute multi-versioning post loops for range
>>>>>> check elimination.  Beforehand cfg optimizations after register
>>>>>> allocation were where post loop optimizations were done for range
>>>>>> checks.  I have added code which produces the desired effect much
>>>>>> earlier by introducing a safe transformation which will minimally
>>>>>> allow a range check free version of the final post loop to execute
>>>>>> up until the point it actually has to take a range check exception
>>>>>> by re-ranging the limit of the rce'd loop, then exit the rce'd
>>>>>> post loop and take the range check exception in the legacy loops execution if required.
>>>>>> If during optimization we discover that we know enough to remove
>>>>>> the range check version of the post loop, mostly by exposing the
>>>>>> load range values into the limit logic of the rce'd post loop, we
>>>>>> will eliminate the range check post loop altogether much like cfg
>>>>>> optimizations did, but much earlier.  This gives optimizations
>>>>>> like programmable SIMD (via SuperWord) the opportunity to
>>>>>> vectorize the rce'd post loops to a single iteration based on mask
>>>>>> vectors which map to the residual iterations. Programmable SIMD
>>>>>> will be a follow on change set utilizing this code to stage its
>>>>>> work. This optimization also exposes the rce'd post loop without flow to other optimizations.
>>>>>> Currently I have enabled this optimization for x86 only.  We base
>>>>>> this loop on successfully rce'd main loops and if for whatever reason, multiversioning fails, we eliminate the loop we added.
>>>>>>
>>>>>> This code was tested as follows:
>>>>>>
>>>>>>
>>>>>> Bug-id: https://bugs.openjdk.java.net/browse/JDK-8151573
>>>>>>
>>>>>>
>>>>>> webrev:
>>>>>>
>>>>>> http://cr.openjdk.java.net/~mcberg/8151573/webrev.01/
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Michael
>>>>>>