Detecting range check elimination with PrintAssembly

Mon Jan 16 16:15:18 PST 2012

Vladimir, thanks for the explanation and the code pointer.

Intuitively, it would seem like a good idea to trust the profile 100% if it
reports the same value used 100% of the time (I can see how anything less
than 100%, even a very high probability of same value, is not trustworthy)
given sufficient trips through the loop.  Although I can see how an app may
have phases where same value is seen for a while before it's switched, but
that's where I thought deopt would help.  There must be a good chunk of
code out there that doesn't know at static compilation time the loop count
(so can't use compile-time constant), but at runtime the actual value
doesn't change for many many trips through the loop; I know I have code
like that in various places.

What's the reason a compilation after deopt would not be as aggressive as
the 1st time? Is it because the profile information may be "weaker" (i.e.
more uncertainty in it)? I thought the profile is completely reset after
deopt, so I would think if the loop is now executed with a different
"constant" value (e.g. in our example, instead of 3 it's now 4), then the
same optimizations will be applied (of course if unrolling the loop is no
longer advantageous due to a much different value, I can see how different
optimizations will be applied).

Thanks

On Mon, Jan 16, 2012 at 6:45 PM, Vladimir Kozlov <vladimir.kozlov at oracle.com
> wrote:

> Vitaly,
>
> We do use profile_trip_cnt during loop unroll calculation but not during
> fully unroll because we can't trust it 100% since program's phase and
> number of iterations could change after method is compiled. See
> policy_unroll() and policy_maximally_unroll():
>
> http://hg.openjdk.java.net/**hsx/hotspot-comp/hotspot/file/**
> 89d0a5d40008/src/share/vm/**opto/loopTransform.cpp<http://hg.openjdk.java.net/hsx/hotspot-comp/hotspot/file/89d0a5d40008/src/share/vm/opto/loopTransform.cpp>
>
> We could use deopt as you suggested but deoptimization is double-edge
> sword, when method recompiled after deoptimization some aggressive
> optimizations will not be executed for it so the new generated code could
> be slower.
>
> Regards,
> Vladimir
>
>
> On 1/16/12 2:37 PM, Vitaly Davidovich wrote:
>
>> Hi Vladimir,
>>
>> If x_col is always seen to be same value in the profile shouldn't the
>> loop be unrolled as well with some deopt guard? Or
>> does this not participate in profiling?
>>
>> Thanks
>>
>> On Jan 16, 2012 4:57 PM, "Vladimir Kozlov" <vladimir.kozlov at oracle.com<mailto:
>> vladimir.kozlov@**oracle.com <vladimir.kozlov at oracle.com>>> wrote:
>>
>>     > be different, because the expressions are similar. The difference
>> in runtime (due to cols being a compile-time
>>     > constant) will be visible elsewhere. Is that right? If so, where
>> would I be able to detect this?
>>
>>    In such situations we usually use some visual tools to see difference
>> between log outputs. At least you can use
>>    'diff'. You may need to replace instructions addresses in outputs
>> (number at the beginning of lines) with the same
>>    value to match. There are few tricks you may use to get similar
>> PrintOptoAssembly output. Use next flags to avoid
>>    mixing output from program output and from 2 compiler threads (flags
>> stop program until a method is compiled and run
>>    only one compiler thread):
>>
>>    -Xbatch -XX:CICompilerCount=1
>>
>>    Also add -XX:+PrintCompilation -XX:+PrintInlining to see what method
>> is compiled and inlined. Note that you may see
>>    similar output for individual methods but could be big difference in
>> compiled caller (computeAll()) method where 2
>>    loop methods could be inlined. So you need to compare all compiled
>> methods.
>>
>>    In general, to have constant as loop limit is always win because some
>> checks in generated code could be avoided and
>>    more optimizations could be done for such loops. Use
>> -XX:+TraceLoopOpts to see what loop optimizations are done in
>>    both cases.
>>
>>    For example, in your code you set 'x_col = 3', as result the next loop
>> in constructNearestClusterVector(**__) will be
>>
>>    fully unrolled when this method is inlined into computeAll() and x_col
>> is replaced with '3':
>>
>>            for(k = 0; k < x_col; k++) {
>>              double tmp = x[i*x_col + k] - mu[j* mu_col + k];
>>              dist += tmp * tmp;
>>            }
>>
>>    Vladimir
>>
>>    On 1/16/12 1:39 AM, Manohar Jonnalagedda wrote:
>>
>>        Hi Kris, Vladimir,
>>
>>        thanks for both your responses.
>>
>>            Second, your two test methods are different so you can't
>> directly compare them. method1() iterates over rows
>>        using
>>            middle loop index 'j' and method2() uses external loop index
>> 'i'. Unless they are typos again.
>>
>>
>>        You are right, these are indeed typos. As Kris suggested, I have
>> the code printed here:
>>        http://pastebin.com/xRFD1Nt1.
>>        The methods corresponding to method1, and method2 are
>> constructNearestClusterVector and computeNewCentroids. Their
>>        PrintOptoAssembly outputs are respectively at
>> http://pastebin.com/1evN8b3K and http://pastebin.com/FxkVWTD5
>>
>>        Also, it seems I have not explained myself correctly. I am not
>> trying to compare the performance of method1 with
>>        respect
>>        to that of method2: method1 and method2 both run in the same
>> program. What I am trying to compare is their
>>        performance
>>        in two cases:
>>        - when cols is a compile-time constant (much faster)
>>        - when cols is a value determined at run-time
>>
>>            If you are using jdk7 there are few flags you can use to print
>> loop optimizations information. They need debug
>>            version of VM but it is not problem for you, I think, since
>> you can use debug PrintOptoAssembly flag.
>>
>>            -XX:+TraceLoopOpts prints all happened loop optimizations and
>> loop tree after each round of loop opts,
>>            -XX:+TraceLoopPredicate prints RC information when it is moved
>> from a loop,
>>            -XX:+TraceRangeLimitCheck prints additional information for RC
>> elimination optimization.
>>
>>
>>        Thanks for these, I will have a look at what they output.
>>
>>            Fourth, range check expression in your example is not what you
>> think. RC expression should be next:
>>            (i*stride+offset) where 'i' is loop variable, 'stride' is
>> constant and 'offset' is loop invariant.
>>
>>            In your example 'offset' is (j * cols) since it is loop
>> invariant, 'k' is loop variable and stride is '1' (one).
>>            In both your methods RC will be moved out of inner loop so the
>> code for it will be the same. The only
>>        difference in
>>            these methods will be where and how (j * cols) and (i * cols)
>> expressions are calculated.
>>
>>
>>                I'd guess it's the difference in locality that made the
>> difference in performance in your two tests.
>>
>>          Thanks for the explanation. I understand from the above that the
>> assembly output in both cases mentioned above
>>        may not
>>        be different, because the expressions are similar. The difference
>> in runtime (due to cols being a compile-time
>>        constant)
>>        will be visible elsewhere. Is that right? If so, where would I be
>> able to detect this?
>>
>>        Cheers,
>>        Manohar
>>
>>                In your PrintOptoAssembly output snippet, the instruction
>> at 0x13e is a LoadRange, which loads the range
>>        from
>>                the header
>>                of an array:
>>
>>                (from x86_64.ad <http://x86_64.ad> <http://x86_64.ad> <
>> http://x86_64.ad>)
>>
>>
>>                // Load Range
>>                instruct loadRange(rRegI dst, memory mem)
>>                %{
>>                   match(Set dst (LoadRange mem));
>>
>>                   ins_cost(125); // XXX
>>                   format %{ "movl    $dst, $mem\t# range" %}
>>                   opcode(0x8B);
>>                   ins_encode(REX_reg_mem(dst, mem), OpcP, reg_mem(dst,
>> mem));
>>                   ins_pipe(ialu_reg_mem);
>>                %}
>>
>>                That's not a range check just yet; the real check, if any,
>> should come after the null check, in the form of
>>                comparing
>>                something else with RSI. But you didn't show what's after
>> the null check, how RSI is used, so it's hard
>>        to say what
>>                you're seeing in your example.
>>
>>                As for the two test examples, could you paste the entire
>> source code, with the PrintOptoAssembly output of
>>                method1() and
>>                method2() ? The first example looks weird, maybe it's a
>> typo but you're using "j < cols" as the loop
>>        condition
>>                for the
>>                inner loop.
>>
>>
>>                - Kris
>>
>>                On Mon, Jan 16, 2012 at 1:59 AM, Manohar Jonnalagedda <
>> manojo10386 at gmail.com
>>        <mailto:manojo10386 at gmail.com> <mailto:manojo10386 at gmail.com<mailto:
>> manojo10386 at gmail.com>**>
>>        <mailto:manojo10386 at gmail.com <mailto:manojo10386 at gmail.com>
>> <mailto:manojo10386 at gmail.com
>>        <mailto:manojo10386 at gmail.com>**>__>> wrote:
>>
>>                    Hello,
>>
>>                    following this reference on Range Check Elimination
>> done by the Hotspot compiler [1], I was keen in
>>        knowing
>>                how I
>>                    can detect whether range checks are taking place in
>> loops by inspecting output using the
>>        PrintAssembly flag;
>>                with
>>                    the old PrintOptoAssembly flag, I have seen output
>> such as the following, which I assume to be range
>>        checks :
>>
>>                    B11: #  B73 B12 &lt;- B10  Freq: 1.21365
>>                    139     movq    RAX, [rsp + #24]  # spill
>>                    13e     movl    RSI, [RAX + #12 (8-bit)]  # range
>>                    141     NullCheck RAX
>>
>>                    What is the equivalent with the new PrintAssembly flag
>> (using hsdis)?
>>
>>                    Moreover, as stated on the wiki page [1], loops are
>> optimized if the stride is a compile-time
>>        constant. I
>>                performed
>>                    a few tests on a kmeans program, with 3 nested loops,
>> having the following (high-level) structure:
>>
>>                    ===
>>                    void method1(){
>>                    //loop 1
>>                    for(int i = 0; i< rows1; i++){
>>                      //...
>>                       for(int j = 0; j< rows2; j++){
>>                       //...
>>                        for(int k = 0; j < cols; k++){ array[j * cols + k]
>> = //...}
>>                       }
>>                    }
>>                    }
>>
>>                    void method2(){
>>                    //loop 2
>>                    for(int i =0; i < rows1; i++){
>>                       for(int j=0 ; i< rows2; j++){
>>                         for(int k=0 ; k< cols; k++){
>>                           array[i*cols+k] = //...
>>                         }
>>                       }
>>                    }
>>                    }
>>
>>                    void main(){
>>
>>                      do{
>>                        method1(); method2();
>>                      }while(!converged)
>>
>>                    }
>>                    ====
>>
>>                    In the first test, cols is an int whose value is
>> determined at runtime (by reading a file), in the
>>        second
>>                test, it
>>                    is given as a compile-time constant(3). In the second
>> test, there is a */significant*/ speed-up
>>        (around 40%).
>>
>>                    However, when studying the diff of the output of
>> PrintOptoAssembly for both method1 and method2,
>>        there is no
>>                    difference (apart from slight value changes in
>> frequency). Would you have any hints as to where I
>>        could look for
>>                    differences?
>>
>>                    Thanks a lot,
>>                    Manohar
>>
>>                    [1] https://wikis.oracle.com/__**
>> display/HotSpotInternals/__**RangeCheckElimination<https://wikis.oracle.com/__display/HotSpotInternals/__RangeCheckElimination>
>>
>>        <https://wikis.oracle.com/**display/HotSpotInternals/**
>> RangeCheckElimination<https://wikis.oracle.com/display/HotSpotInternals/RangeCheckElimination>
>> >
>>
>>
>>
>>        As Kris pointed you need to fix your example:
>>
>>

-- 
Vitaly
617-548-7007 (mobile)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-dev/attachments/20120116/cb2bce5a/attachment-0001.html