Detecting range check elimination with PrintAssembly

Mon Jan 16 15:45:14 PST 2012

Vitaly,

We do use profile_trip_cnt during loop unroll calculation but not during fully unroll because we can't trust it 100% 
since program's phase and number of iterations could change after method is compiled. See policy_unroll() and 
policy_maximally_unroll():

http://hg.openjdk.java.net/hsx/hotspot-comp/hotspot/file/89d0a5d40008/src/share/vm/opto/loopTransform.cpp

We could use deopt as you suggested but deoptimization is double-edge sword, when method recompiled after deoptimization 
some aggressive optimizations will not be executed for it so the new generated code could be slower.

Regards,
Vladimir

On 1/16/12 2:37 PM, Vitaly Davidovich wrote:
> Hi Vladimir,
>
> If x_col is always seen to be same value in the profile shouldn't the loop be unrolled as well with some deopt guard? Or
> does this not participate in profiling?
>
> Thanks
>
> On Jan 16, 2012 4:57 PM, "Vladimir Kozlov" <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote:
>
>      > be different, because the expressions are similar. The difference in runtime (due to cols being a compile-time
>      > constant) will be visible elsewhere. Is that right? If so, where would I be able to detect this?
>
>     In such situations we usually use some visual tools to see difference between log outputs. At least you can use
>     'diff'. You may need to replace instructions addresses in outputs (number at the beginning of lines) with the same
>     value to match. There are few tricks you may use to get similar PrintOptoAssembly output. Use next flags to avoid
>     mixing output from program output and from 2 compiler threads (flags stop program until a method is compiled and run
>     only one compiler thread):
>
>     -Xbatch -XX:CICompilerCount=1
>
>     Also add -XX:+PrintCompilation -XX:+PrintInlining to see what method is compiled and inlined. Note that you may see
>     similar output for individual methods but could be big difference in compiled caller (computeAll()) method where 2
>     loop methods could be inlined. So you need to compare all compiled methods.
>
>     In general, to have constant as loop limit is always win because some checks in generated code could be avoided and
>     more optimizations could be done for such loops. Use -XX:+TraceLoopOpts to see what loop optimizations are done in
>     both cases.
>
>     For example, in your code you set 'x_col = 3', as result the next loop in constructNearestClusterVector(__) will be
>     fully unrolled when this method is inlined into computeAll() and x_col is replaced with '3':
>
>             for(k = 0; k < x_col; k++) {
>               double tmp = x[i*x_col + k] - mu[j* mu_col + k];
>               dist += tmp * tmp;
>             }
>
>     Vladimir
>
>     On 1/16/12 1:39 AM, Manohar Jonnalagedda wrote:
>
>         Hi Kris, Vladimir,
>
>         thanks for both your responses.
>
>             Second, your two test methods are different so you can't directly compare them. method1() iterates over rows
>         using
>             middle loop index 'j' and method2() uses external loop index 'i'. Unless they are typos again.
>
>
>         You are right, these are indeed typos. As Kris suggested, I have the code printed here:
>         http://pastebin.com/xRFD1Nt1.
>         The methods corresponding to method1, and method2 are constructNearestClusterVector and computeNewCentroids. Their
>         PrintOptoAssembly outputs are respectively at http://pastebin.com/1evN8b3K and http://pastebin.com/FxkVWTD5
>
>         Also, it seems I have not explained myself correctly. I am not trying to compare the performance of method1 with
>         respect
>         to that of method2: method1 and method2 both run in the same program. What I am trying to compare is their
>         performance
>         in two cases:
>         - when cols is a compile-time constant (much faster)
>         - when cols is a value determined at run-time
>
>             If you are using jdk7 there are few flags you can use to print loop optimizations information. They need debug
>             version of VM but it is not problem for you, I think, since you can use debug PrintOptoAssembly flag.
>
>             -XX:+TraceLoopOpts prints all happened loop optimizations and loop tree after each round of loop opts,
>             -XX:+TraceLoopPredicate prints RC information when it is moved from a loop,
>             -XX:+TraceRangeLimitCheck prints additional information for RC elimination optimization.
>
>
>         Thanks for these, I will have a look at what they output.
>
>             Fourth, range check expression in your example is not what you think. RC expression should be next:
>             (i*stride+offset) where 'i' is loop variable, 'stride' is constant and 'offset' is loop invariant.
>
>             In your example 'offset' is (j * cols) since it is loop invariant, 'k' is loop variable and stride is '1' (one).
>             In both your methods RC will be moved out of inner loop so the code for it will be the same. The only
>         difference in
>             these methods will be where and how (j * cols) and (i * cols) expressions are calculated.
>
>
>                 I'd guess it's the difference in locality that made the difference in performance in your two tests.
>
>           Thanks for the explanation. I understand from the above that the assembly output in both cases mentioned above
>         may not
>         be different, because the expressions are similar. The difference in runtime (due to cols being a compile-time
>         constant)
>         will be visible elsewhere. Is that right? If so, where would I be able to detect this?
>
>         Cheers,
>         Manohar
>
>                 In your PrintOptoAssembly output snippet, the instruction at 0x13e is a LoadRange, which loads the range
>         from
>                 the header
>                 of an array:
>
>                 (from x86_64.ad <http://x86_64.ad> <http://x86_64.ad> <http://x86_64.ad>)
>
>                 // Load Range
>                 instruct loadRange(rRegI dst, memory mem)
>                 %{
>                    match(Set dst (LoadRange mem));
>
>                    ins_cost(125); // XXX
>                    format %{ "movl    $dst, $mem\t# range" %}
>                    opcode(0x8B);
>                    ins_encode(REX_reg_mem(dst, mem), OpcP, reg_mem(dst, mem));
>                    ins_pipe(ialu_reg_mem);
>                 %}
>
>                 That's not a range check just yet; the real check, if any, should come after the null check, in the form of
>                 comparing
>                 something else with RSI. But you didn't show what's after the null check, how RSI is used, so it's hard
>         to say what
>                 you're seeing in your example.
>
>                 As for the two test examples, could you paste the entire source code, with the PrintOptoAssembly output of
>                 method1() and
>                 method2() ? The first example looks weird, maybe it's a typo but you're using "j < cols" as the loop
>         condition
>                 for the
>                 inner loop.
>
>
>                 - Kris
>
>                 On Mon, Jan 16, 2012 at 1:59 AM, Manohar Jonnalagedda <manojo10386 at gmail.com
>         <mailto:manojo10386 at gmail.com> <mailto:manojo10386 at gmail.com <mailto:manojo10386 at gmail.com>>
>         <mailto:manojo10386 at gmail.com <mailto:manojo10386 at gmail.com> <mailto:manojo10386 at gmail.com
>         <mailto:manojo10386 at gmail.com>>__>> wrote:
>
>                     Hello,
>
>                     following this reference on Range Check Elimination done by the Hotspot compiler [1], I was keen in
>         knowing
>                 how I
>                     can detect whether range checks are taking place in loops by inspecting output using the
>         PrintAssembly flag;
>                 with
>                     the old PrintOptoAssembly flag, I have seen output such as the following, which I assume to be range
>         checks :
>
>                     B11: #  B73 B12 &lt;- B10  Freq: 1.21365
>                     139     movq    RAX, [rsp + #24]  # spill
>                     13e     movl    RSI, [RAX + #12 (8-bit)]  # range
>                     141     NullCheck RAX
>
>                     What is the equivalent with the new PrintAssembly flag (using hsdis)?
>
>                     Moreover, as stated on the wiki page [1], loops are optimized if the stride is a compile-time
>         constant. I
>                 performed
>                     a few tests on a kmeans program, with 3 nested loops, having the following (high-level) structure:
>
>                     ===
>                     void method1(){
>                     //loop 1
>                     for(int i = 0; i< rows1; i++){
>                       //...
>                        for(int j = 0; j< rows2; j++){
>                        //...
>                         for(int k = 0; j < cols; k++){ array[j * cols + k] = //...}
>                        }
>                     }
>                     }
>
>                     void method2(){
>                     //loop 2
>                     for(int i =0; i < rows1; i++){
>                        for(int j=0 ; i< rows2; j++){
>                          for(int k=0 ; k< cols; k++){
>                            array[i*cols+k] = //...
>                          }
>                        }
>                     }
>                     }
>
>                     void main(){
>
>                       do{
>                         method1(); method2();
>                       }while(!converged)
>
>                     }
>                     ====
>
>                     In the first test, cols is an int whose value is determined at runtime (by reading a file), in the
>         second
>                 test, it
>                     is given as a compile-time constant(3). In the second test, there is a */significant*/ speed-up
>         (around 40%).
>
>                     However, when studying the diff of the output of PrintOptoAssembly for both method1 and method2,
>         there is no
>                     difference (apart from slight value changes in frequency). Would you have any hints as to where I
>         could look for
>                     differences?
>
>                     Thanks a lot,
>                     Manohar
>
>                     [1] https://wikis.oracle.com/__display/HotSpotInternals/__RangeCheckElimination
>         <https://wikis.oracle.com/display/HotSpotInternals/RangeCheckElimination>
>
>
>
>         As Kris pointed you need to fix your example:
>