CR for RFR 8151573

Tue Mar 15 23:14:39 UTC 2016

Correction below...

-----Original Message-----
From: hotspot-compiler-dev [mailto:hotspot-compiler-dev-bounces at openjdk.java.net] On Behalf Of Berg, Michael C
Sent: Tuesday, March 15, 2016 4:08 PM
To: Vladimir Kozlov; 'hotspot-compiler-dev at openjdk.java.net'
Subject: RE: CR for RFR 8151573

Vladimir for programmable SIMD which is the optimization which uses this implementation, I get the following on micros and code in general that look like this:

    for(int i = 0; i < process_len; i++)
    {
      d[i]= (a[i] * b[i]) + (a[i] * c[i]) + (b[i] * c[i]);
    }

The above code makes 9 vector ops.

For float with vector length VecZ, I get as much as 1.3x and for int as much as 1.4x uplift.
For double and long on VecZ it is smaller, but then so is the value of vectorization on those types anyways.
The value process_len is some fraction of the array length in my measurements.  The idea of the metrics Is to pose a post loop with a modest amount of iterations in it.  For instance N is the max trip of the post loop, and N is 1..VecZ-1 size, then for float we could do as many as 15 iterations in the fixup loop.

An example would be array_length = 512, process_len is a range of 81..96, we create a VecZ loop which was superunrolled 4 times with vector length 16, or unroll of 64, we align process 4 iterations, and the vectorized post loop is executed 1 time, leaving the remaining work in the final post loop, in this case possibly a mutilversioned post loop.  We start that final loop at iteration 81 so we always do at least 1 iteration fixup, and as many as 15.  If we left the fixup loop as a scalar loop that would mean 1 to 15 iterations plus our initial loops which have {4,1,1} iterations as a group or 6 to get us to index 80.  By vectorizing the fixup loop to one iteration we now always have 7 iterations in our loops for all ranges of 81..96, without this optimization and programmable SIMD, we would have the initial 6 plush 1 to 15 more, or a range of 7 to 21 iterations.

Would you prefer I integrate this with programmable SIMD and submit the patches as one?

I thought it would be easier to do them separately.  Also, exposing the post loops to this path offloads cfg processing to earlier compilation, making the graph less complex through register allocation.

Regards,
Michael

-----Original Message-----
From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
Sent: Tuesday, March 15, 2016 2:42 PM
To: Berg, Michael C; 'hotspot-compiler-dev at openjdk.java.net'
Subject: Re: CR for RFR 8151573

Hi Michael,

Changes are significant so they have to be justified. Especially since we are in later stage of jdk9 development. Do you have performance numbers (not only for microbenchmarhks) which show the benefit of these changes?

Thanks,
Vladimir

On 3/15/16 2:04 PM, Berg, Michael C wrote:
> Hi Folks,
>
> I would like to contribute multi-versioning post loops for range check 
> elimination.  Beforehand cfg optimizations after register allocation 
> were where post loop optimizations were done for range checks.  I have 
> added code which produces the desired effect much earlier by 
> introducing a safe transformation which will minimally allow a range 
> check free version of the final post loop to execute up until the 
> point it actually has to take a range check exception by re-ranging 
> the limit of the rce'd loop, then exit the rce'd post loop and take 
> the range check exception in the legacy loops execution if required.
> If during optimization we discover that we know enough to remove the 
> range check version of the post loop, mostly by exposing the load 
> range values into the limit logic of the rce'd post loop, we will 
> eliminate the range check post loop altogether much like cfg 
> optimizations did, but much earlier.  This gives optimizations like 
> programmable SIMD (via SuperWord) the opportunity to vectorize the 
> rce'd post loops to a single iteration based on mask vectors which map 
> to the residual iterations. Programmable SIMD will be a follow on 
> change set utilizing this code to stage its work. This optimization 
> also exposes the rce'd post loop without flow to other optimizations.
> Currently I have enabled this optimization for x86 only.  We base this 
> loop on successfully rce'd main loops and if for whatever reason, multiversioning fails, we eliminate the loop we added.
>
> This code was tested as follows:
>
>
> Bug-id: https://bugs.openjdk.java.net/browse/JDK-8151573
>
>
> webrev:
>
> http://cr.openjdk.java.net/~mcberg/8151573/webrev.01/
>
> Thanks,
>
> Michael
>