Detecting range check elimination with PrintAssembly

Thu Jan 19 03:18:09 PST 2012

Hi Vladimir, Vitaly,

In such situations we usually use some visual tools to see difference
> between log outputs.

Would you recommend any? I have had troubles using IdealGraphVisualizer:
either the overhead of using it makes the methods not get JIT compiled, or,
if I decrese the CompileThreshold, the VM crashes with an out of memory
error.

> At least you can use 'diff'. You may need to replace instructions
> addresses in outputs (number at the beginning of lines) with the same value
> to match. There are few tricks you may use to get similar PrintOptoAssembly
> output. Use next flags to avoid mixing output from program output and from
> 2 compiler threads (flags stop program until a method is compiled and run
> only one compiler thread):
>
> -Xbatch -XX:CICompilerCount=1
>
> Also add -XX:+PrintCompilation -XX:+PrintInlining to see what method is
> compiled and inlined. Note that you may see similar output for individual
> methods but could be big difference in compiled caller (computeAll())
> method where 2 loop methods could be inlined. So you need to compare all
> compiled methods.
>

PrintInlining tells me that constructNearestClusterVector is inlined into
ComputeAll, and from what I understand, computeNewCentroids is not:

@ 109   Kmeans::computeNewCentroids (163 bytes)   already compiled into a
big method

I have this message 3 times, do these belong to the Pre-Main-Post of the
loop in the computeAll method?

> In general, to have constant as loop limit is always win because some
> checks in generated code could be avoided and more optimizations could be
> done for such loops. Use -XX:+TraceLoopOpts to see what loop optimizations
> are done in both cases.
>

I am new to this flag: should I use it in conjunction with the
PrintOptoAssembly output for identifying loop numbers with the code they
represent? Would there be any other way to interpret these?

Thanks,
Manohar

>
> For example, in your code you set 'x_col = 3', as result the next loop in
> constructNearestClusterVector() will be fully unrolled when this method is
> inlined into computeAll() and x_col is replaced with '3':
>
>        for(k = 0; k < x_col; k++) {
>          double tmp = x[i*x_col + k] - mu[j* mu_col + k];
>          dist += tmp * tmp;
>        }
>
> Vladimir
>
>
> On 1/16/12 1:39 AM, Manohar Jonnalagedda wrote:
>
>> Hi Kris, Vladimir,
>>
>> thanks for both your responses.
>>
>>    Second, your two test methods are different so you can't directly
>> compare them. method1() iterates over rows using
>>    middle loop index 'j' and method2() uses external loop index 'i'.
>> Unless they are typos again.
>>
>>
>> You are right, these are indeed typos. As Kris suggested, I have the code
>> printed here: http://pastebin.com/xRFD1Nt1.
>> The methods corresponding to method1, and method2 are
>> constructNearestClusterVector and computeNewCentroids. Their
>> PrintOptoAssembly outputs are respectively at
>> http://pastebin.com/1evN8b3K and http://pastebin.com/FxkVWTD5
>>
>> Also, it seems I have not explained myself correctly. I am not trying to
>> compare the performance of method1 with respect
>> to that of method2: method1 and method2 both run in the same program.
>> What I am trying to compare is their performance
>> in two cases:
>> - when cols is a compile-time constant (much faster)
>> - when cols is a value determined at run-time
>>
>>    If you are using jdk7 there are few flags you can use to print loop
>> optimizations information. They need debug
>>    version of VM but it is not problem for you, I think, since you can
>> use debug PrintOptoAssembly flag.
>>
>>    -XX:+TraceLoopOpts prints all happened loop optimizations and loop
>> tree after each round of loop opts,
>>    -XX:+TraceLoopPredicate prints RC information when it is moved from a
>> loop,
>>    -XX:+TraceRangeLimitCheck prints additional information for RC
>> elimination optimization.
>>
>>
>> Thanks for these, I will have a look at what they output.
>>
>>    Fourth, range check expression in your example is not what you think.
>> RC expression should be next:
>>    (i*stride+offset) where 'i' is loop variable, 'stride' is constant and
>> 'offset' is loop invariant.
>>
>>    In your example 'offset' is (j * cols) since it is loop invariant, 'k'
>> is loop variable and stride is '1' (one).
>>    In both your methods RC will be moved out of inner loop so the code
>> for it will be the same. The only difference in
>>    these methods will be where and how (j * cols) and (i * cols)
>> expressions are calculated.
>>
>>
>>        I'd guess it's the difference in locality that made the difference
>> in performance in your two tests.
>>
>>  Thanks for the explanation. I understand from the above that the
>> assembly output in both cases mentioned above may not
>> be different, because the expressions are similar. The difference in
>> runtime (due to cols being a compile-time constant)
>> will be visible elsewhere. Is that right? If so, where would I be able to
>> detect this?
>>
>> Cheers,
>> Manohar
>>
>>        In your PrintOptoAssembly output snippet, the instruction at 0x13e
>> is a LoadRange, which loads the range from
>>        the header
>>        of an array:
>>
>>        (from x86_64.ad <http://x86_64.ad> <http://x86_64.ad>)
>>
>>
>>        // Load Range
>>        instruct loadRange(rRegI dst, memory mem)
>>        %{
>>           match(Set dst (LoadRange mem));
>>
>>           ins_cost(125); // XXX
>>           format %{ "movl    $dst, $mem\t# range" %}
>>           opcode(0x8B);
>>           ins_encode(REX_reg_mem(dst, mem), OpcP, reg_mem(dst, mem));
>>           ins_pipe(ialu_reg_mem);
>>        %}
>>
>>        That's not a range check just yet; the real check, if any, should
>> come after the null check, in the form of
>>        comparing
>>        something else with RSI. But you didn't show what's after the null
>> check, how RSI is used, so it's hard to say what
>>        you're seeing in your example.
>>
>>        As for the two test examples, could you paste the entire source
>> code, with the PrintOptoAssembly output of
>>        method1() and
>>        method2() ? The first example looks weird, maybe it's a typo but
>> you're using "j < cols" as the loop condition
>>        for the
>>        inner loop.
>>
>>
>>        - Kris
>>
>>        On Mon, Jan 16, 2012 at 1:59 AM, Manohar Jonnalagedda <
>> manojo10386 at gmail.com <mailto:manojo10386 at gmail.com>
>>        <mailto:manojo10386 at gmail.com <mailto:manojo10386 at gmail.com>>>
>> wrote:
>>
>>            Hello,
>>
>>            following this reference on Range Check Elimination done by
>> the Hotspot compiler [1], I was keen in knowing
>>        how I
>>            can detect whether range checks are taking place in loops by
>> inspecting output using the PrintAssembly flag;
>>        with
>>            the old PrintOptoAssembly flag, I have seen output such as the
>> following, which I assume to be range checks :
>>
>>            B11: #  B73 B12 &lt;- B10  Freq: 1.21365
>>            139     movq    RAX, [rsp + #24]  # spill
>>            13e     movl    RSI, [RAX + #12 (8-bit)]  # range
>>            141     NullCheck RAX
>>
>>            What is the equivalent with the new PrintAssembly flag (using
>> hsdis)?
>>
>>            Moreover, as stated on the wiki page [1], loops are optimized
>> if the stride is a compile-time constant. I
>>        performed
>>            a few tests on a kmeans program, with 3 nested loops, having
>> the following (high-level) structure:
>>
>>            ===
>>            void method1(){
>>            //loop 1
>>            for(int i = 0; i< rows1; i++){
>>              //...
>>               for(int j = 0; j< rows2; j++){
>>               //...
>>                for(int k = 0; j < cols; k++){ array[j * cols + k] = //...}
>>               }
>>            }
>>            }
>>
>>            void method2(){
>>            //loop 2
>>            for(int i =0; i < rows1; i++){
>>               for(int j=0 ; i< rows2; j++){
>>                 for(int k=0 ; k< cols; k++){
>>                   array[i*cols+k] = //...
>>                 }
>>               }
>>            }
>>            }
>>
>>            void main(){
>>
>>              do{
>>                method1(); method2();
>>              }while(!converged)
>>
>>            }
>>            ====
>>
>>            In the first test, cols is an int whose value is determined at
>> runtime (by reading a file), in the second
>>        test, it
>>            is given as a compile-time constant(3). In the second test,
>> there is a */significant*/ speed-up (around 40%).
>>
>>            However, when studying the diff of the output of
>> PrintOptoAssembly for both method1 and method2, there is no
>>            difference (apart from slight value changes in frequency).
>> Would you have any hints as to where I could look for
>>            differences?
>>
>>            Thanks a lot,
>>            Manohar
>>
>>            [1]
>> https://wikis.oracle.com/display/HotSpotInternals/RangeCheckElimination
>>
>>
>>
>> As Kris pointed you need to fix your example:
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-dev/attachments/20120119/2d6a78c8/attachment.html