Detecting range check elimination with PrintAssembly

Vladimir Kozlov vladimir.kozlov at oracle.com
Thu Jan 19 10:07:12 PST 2012


Manohar Jonnalagedda wrote:
> Hi Vladimir, Vitaly,
> 
>     In such situations we usually use some visual tools to see
>     difference between log outputs. 
> 
> 
> Would you recommend any? I have had troubles using IdealGraphVisualizer: 
> either the overhead of using it makes the methods not get JIT compiled, 
> or, if I decrese the CompileThreshold, the VM crashes with an out of 
> memory error.

PrintOptoAssembly output is simple text. You can use any text compare tools 
available on net (google it). IdealGraphVisualizer is C2 tool to trace C2 
optimization, you don't need it to compare text outputs.

> 
>     At least you can use 'diff'. You may need to replace instructions
>     addresses in outputs (number at the beginning of lines) with the
>     same value to match. There are few tricks you may use to get similar
>     PrintOptoAssembly output. Use next flags to avoid mixing output from
>     program output and from 2 compiler threads (flags stop program until
>     a method is compiled and run only one compiler thread):
> 
>     -Xbatch -XX:CICompilerCount=1
> 
>     Also add -XX:+PrintCompilation -XX:+PrintInlining to see what method
>     is compiled and inlined. Note that you may see similar output for
>     individual methods but could be big difference in compiled caller
>     (computeAll()) method where 2 loop methods could be inlined. So you
>     need to compare all compiled methods.
> 
> 
> PrintInlining tells me that constructNearestClusterVector is inlined 
> into ComputeAll, and from what I understand, computeNewCentroids is not:
> 
> @ 109   Kmeans::computeNewCentroids (163 bytes)   already compiled into 
> a big method

Correct, it is big method and C2 decide to not inline it. You can change it by 
increasing InlineSmallCode value, for example -XX:InlineSmallCode=1500.

> 
> I have this message 3 times, do these belong to the Pre-Main-Post of the 
> loop in the computeAll method?
>  
> 
>     In general, to have constant as loop limit is always win because
>     some checks in generated code could be avoided and more
>     optimizations could be done for such loops. Use -XX:+TraceLoopOpts
>     to see what loop optimizations are done in both cases.
> 
> I am new to this flag: should I use it in conjunction with the 
> PrintOptoAssembly output for identifying loop numbers with the code they 
> represent? Would there be any other way to interpret these?

I suggested it to see what different loop optimizations are performed when a 
constant is used vs not. For example, if you get "MaxUnroll" instead of 
"PreMainPost" the generated code should be better.

Vladimir

> 
> 
> Thanks,
> Manohar
>  
> 
> 
>     For example, in your code you set 'x_col = 3', as result the next
>     loop in constructNearestClusterVector() will be fully unrolled when
>     this method is inlined into computeAll() and x_col is replaced with '3':
> 
>            for(k = 0; k < x_col; k++) {
>              double tmp = x[i*x_col + k] - mu[j* mu_col + k];
>              dist += tmp * tmp;
>            }
> 
>     Vladimir
> 
> 
>     On 1/16/12 1:39 AM, Manohar Jonnalagedda wrote:
> 
>         Hi Kris, Vladimir,
> 
>         thanks for both your responses.
> 
>            Second, your two test methods are different so you can't
>         directly compare them. method1() iterates over rows using
>            middle loop index 'j' and method2() uses external loop index
>         'i'. Unless they are typos again.
> 
> 
>         You are right, these are indeed typos. As Kris suggested, I have
>         the code printed here: http://pastebin.com/xRFD1Nt1.
>         The methods corresponding to method1, and method2 are
>         constructNearestClusterVector and computeNewCentroids. Their
>         PrintOptoAssembly outputs are respectively at
>         http://pastebin.com/1evN8b3K and http://pastebin.com/FxkVWTD5
> 
>         Also, it seems I have not explained myself correctly. I am not
>         trying to compare the performance of method1 with respect
>         to that of method2: method1 and method2 both run in the same
>         program. What I am trying to compare is their performance
>         in two cases:
>         - when cols is a compile-time constant (much faster)
>         - when cols is a value determined at run-time
> 
>            If you are using jdk7 there are few flags you can use to
>         print loop optimizations information. They need debug
>            version of VM but it is not problem for you, I think, since
>         you can use debug PrintOptoAssembly flag.
> 
>            -XX:+TraceLoopOpts prints all happened loop optimizations and
>         loop tree after each round of loop opts,
>            -XX:+TraceLoopPredicate prints RC information when it is
>         moved from a loop,
>            -XX:+TraceRangeLimitCheck prints additional information for
>         RC elimination optimization.
> 
> 
>         Thanks for these, I will have a look at what they output.
> 
>            Fourth, range check expression in your example is not what
>         you think. RC expression should be next:
>            (i*stride+offset) where 'i' is loop variable, 'stride' is
>         constant and 'offset' is loop invariant.
> 
>            In your example 'offset' is (j * cols) since it is loop
>         invariant, 'k' is loop variable and stride is '1' (one).
>            In both your methods RC will be moved out of inner loop so
>         the code for it will be the same. The only difference in
>            these methods will be where and how (j * cols) and (i * cols)
>         expressions are calculated.
> 
> 
>                I'd guess it's the difference in locality that made the
>         difference in performance in your two tests.
> 
>          Thanks for the explanation. I understand from the above that
>         the assembly output in both cases mentioned above may not
>         be different, because the expressions are similar. The
>         difference in runtime (due to cols being a compile-time constant)
>         will be visible elsewhere. Is that right? If so, where would I
>         be able to detect this?
> 
>         Cheers,
>         Manohar
> 
>                In your PrintOptoAssembly output snippet, the instruction
>         at 0x13e is a LoadRange, which loads the range from
>                the header
>                of an array:
> 
>                (from x86_64.ad <http://x86_64.ad> <http://x86_64.ad>
>         <http://x86_64.ad>)
> 
> 
>                // Load Range
>                instruct loadRange(rRegI dst, memory mem)
>                %{
>                   match(Set dst (LoadRange mem));
> 
>                   ins_cost(125); // XXX
>                   format %{ "movl    $dst, $mem\t# range" %}
>                   opcode(0x8B);
>                   ins_encode(REX_reg_mem(dst, mem), OpcP, reg_mem(dst,
>         mem));
>                   ins_pipe(ialu_reg_mem);
>                %}
> 
>                That's not a range check just yet; the real check, if
>         any, should come after the null check, in the form of
>                comparing
>                something else with RSI. But you didn't show what's after
>         the null check, how RSI is used, so it's hard to say what
>                you're seeing in your example.
> 
>                As for the two test examples, could you paste the entire
>         source code, with the PrintOptoAssembly output of
>                method1() and
>                method2() ? The first example looks weird, maybe it's a
>         typo but you're using "j < cols" as the loop condition
>                for the
>                inner loop.
> 
> 
>                - Kris
> 
>                On Mon, Jan 16, 2012 at 1:59 AM, Manohar Jonnalagedda
>         <manojo10386 at gmail.com <mailto:manojo10386 at gmail.com>
>         <mailto:manojo10386 at gmail.com <mailto:manojo10386 at gmail.com>>
>                <mailto:manojo10386 at gmail.com
>         <mailto:manojo10386 at gmail.com> <mailto:manojo10386 at gmail.com
>         <mailto:manojo10386 at gmail.com>>>> wrote:
> 
>                    Hello,
> 
>                    following this reference on Range Check Elimination
>         done by the Hotspot compiler [1], I was keen in knowing
>                how I
>                    can detect whether range checks are taking place in
>         loops by inspecting output using the PrintAssembly flag;
>                with
>                    the old PrintOptoAssembly flag, I have seen output
>         such as the following, which I assume to be range checks :
> 
>                    B11: #  B73 B12 &lt;- B10  Freq: 1.21365
>                    139     movq    RAX, [rsp + #24]  # spill
>                    13e     movl    RSI, [RAX + #12 (8-bit)]  # range
>                    141     NullCheck RAX
> 
>                    What is the equivalent with the new PrintAssembly
>         flag (using hsdis)?
> 
>                    Moreover, as stated on the wiki page [1], loops are
>         optimized if the stride is a compile-time constant. I
>                performed
>                    a few tests on a kmeans program, with 3 nested loops,
>         having the following (high-level) structure:
> 
>                    ===
>                    void method1(){
>                    //loop 1
>                    for(int i = 0; i< rows1; i++){
>                      //...
>                       for(int j = 0; j< rows2; j++){
>                       //...
>                        for(int k = 0; j < cols; k++){ array[j * cols +
>         k] = //...}
>                       }
>                    }
>                    }
> 
>                    void method2(){
>                    //loop 2
>                    for(int i =0; i < rows1; i++){
>                       for(int j=0 ; i< rows2; j++){
>                         for(int k=0 ; k< cols; k++){
>                           array[i*cols+k] = //...
>                         }
>                       }
>                    }
>                    }
> 
>                    void main(){
> 
>                      do{
>                        method1(); method2();
>                      }while(!converged)
> 
>                    }
>                    ====
> 
>                    In the first test, cols is an int whose value is
>         determined at runtime (by reading a file), in the second
>                test, it
>                    is given as a compile-time constant(3). In the second
>         test, there is a */significant*/ speed-up (around 40%).
> 
>                    However, when studying the diff of the output of
>         PrintOptoAssembly for both method1 and method2, there is no
>                    difference (apart from slight value changes in
>         frequency). Would you have any hints as to where I could look for
>                    differences?
> 
>                    Thanks a lot,
>                    Manohar
> 
>                    [1]
>         https://wikis.oracle.com/display/HotSpotInternals/RangeCheckElimination
> 
> 
> 
>         As Kris pointed you need to fix your example:
> 
> 
> 


More information about the hotspot-dev mailing list