Detecting range check elimination with PrintAssembly

Vitaly Davidovich vitalyd at gmail.com
Tue Jan 17 10:10:05 PST 2012


Awesome info - thanks very much for taking the time to answer.

Sent from my phone
On Jan 17, 2012 1:02 PM, "Vladimir Kozlov" <vladimir.kozlov at oracle.com>
wrote:

> Vitaly Davidovich wrote:
>
>> Vladimir, thanks for that -- very useful and interesting to know.  So I
>> take it a compilation after initial deopt may be less aggressive than 1st
>> time around simply because the accumulated MD might be in a state such that
>> it guides optimization in a different direction, is that right?
>>
>
> Yes, it is right, especially for inlining dynamic calls based on target
> profiling when after some time an other method is called. We also have
> limit how many times method could be recompiled after which the method will
> be only interpreted. Also we have limit on how many uncommon traps
> (deoptimizations) happened in method after which we stop generating some
> optimizations which use uncommon traps. Usually it means generating slow
> paths in compiled code which we thought (based on original profiling)
> should not be executed (so we generated uncommon trap). As result compiled
> code size will be changed and inlining also. Note, it happens not
> immediately but only after significant number of deoptimizations:
> PerMethodTrapLimit is 100 and PerMethodRecompilationCutoff is 400.
>
>
>> I guess how does one reason about the generated assembly when it's so
>> (possibly) dynamic? :) I've looked at the assembly in a product hotspot via
>> hsdis, but then you realize that you may be looking at just one compilation
>> of it, and it may be different at some point later.  Also, for something
>> like Intel's VTune (or Amplifier) that support showing JIT'd assembly, I
>> guess it's the same issue? This is quite different, of course, with .NET
>> CLR's JIT because there's no profiling there -- 1st time method is hit, it
>> gets JIT'd and there's no code pitching (makes looking at the assembly seem
>> a bit more "reassuring").
>>
>
> VTune shows several compiled methods. We have our profiling tool on
> Solaris (Analizer) which do the same. Note that deoptimization and
> recompilation usually happen only during startup. So only latest compiled
> version used in long run.
>
>
>> In general, would you say that the C2 compiler favors "over
>> aggressiveness" with deopt guards? Or does it try to avoid deopts and only
>> perform aggressive optimizations where the profile is quite conclusive on
>> type and/or counters?
>>
>
> C2 does compilations as aggressive as possible. And it will be less
> aggressive only during later recompilations.
>
> Regards,
> Vladimir
>
>
>> Thanks
>>
>> On Mon, Jan 16, 2012 at 11:52 PM, Vladimir Kozlov <
>> vladimir.kozlov at oracle.com <mailto:vladimir.kozlov@**oracle.com<vladimir.kozlov at oracle.com>>>
>> wrote:
>>
>>    I need to be more clear. We can't get 100% correct profiling result
>>    for counters. First, profiling structure (MethodData) are created
>>    only after method is executed for some time (for Server compiler it
>>    is 3000 invocations). It is done to avoid profiling short programs
>>    since profiling will slowdown Interpreter by about 20%. This delayed
>>    profiling introduce first discrepancy between counters. MD could be
>>    also created when method has a hot loop (a lot of iterations) and
>>    OSR compilation is requested. As result counters in the loop will
>>    not correlate with counters before the loop. An other reason for not
>>    precise profiling counters is execution of method by multiple
>>    threads. We have only one MD structure per method which is updated
>>    (not atomically for speed) by all such threads. And when method is
>>    prepared for compilation compiler thread creates snapshot of MD
>>    which could happened in the middle of method execution by java threads.
>>
>>    We don't reset MD during deoptimization because we still need
>>    previous values (in addition to counters we collect type
>>    information). So counters are accumulated. And profiling is resumed
>>    from deoptimization point which again adds discrepancy.
>>
>>    I hope it explains why we can't trust 100% to profiling counters.
>>
>>    Regards,
>>    Vladimir
>>
>>
>>    On 1/16/12 4:15 PM, Vitaly Davidovich wrote:
>>
>>        Vladimir, thanks for the explanation and the code pointer.
>>
>>        Intuitively, it would seem like a good idea to trust the profile
>>        100% if it reports the same value used 100% of the time
>>        (I can see how anything less than 100%, even a very high
>>        probability of same value, is not trustworthy) given sufficient
>>        trips through the loop.  Although I can see how an app may have
>>        phases where same value is seen for a while before it's
>>        switched, but that's where I thought deopt would help.  There
>>        must be a good chunk of code out there that doesn't know
>>        at static compilation time the loop count (so can't use
>>        compile-time constant), but at runtime the actual value doesn't
>>        change for many many trips through the loop; I know I have code
>>        like that in various places.
>>
>>        What's the reason a compilation after deopt would not be as
>>        aggressive as the 1st time? Is it because the profile
>>        information may be "weaker" (i.e. more uncertainty in it)? I
>>        thought the profile is completely reset after deopt, so I
>>        would think if the loop is now executed with a different
>>        "constant" value (e.g. in our example, instead of 3 it's now
>>        4), then the same optimizations will be applied (of course if
>>        unrolling the loop is no longer advantageous due to a much
>>        different value, I can see how different optimizations will be
>>        applied).
>>
>>        Thanks
>>
>>        On Mon, Jan 16, 2012 at 6:45 PM, Vladimir Kozlov
>>        <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov@**oracle.com<vladimir.kozlov at oracle.com>
>> >
>>        <mailto:vladimir.kozlov at __orac**le.com <http://oracle.com>
>>        <mailto:vladimir.kozlov@**oracle.com <vladimir.kozlov at oracle.com>>>>
>> wrote:
>>
>>           Vitaly,
>>
>>           We do use profile_trip_cnt during loop unroll calculation but
>>        not during fully unroll because we can't trust it 100%
>>           since program's phase and number of iterations could change
>>        after method is compiled. See policy_unroll() and
>>           policy_maximally_unroll():
>>
>>                  http://hg.openjdk.java.net/___**
>> _hsx/hotspot-comp/hotspot/**file/____89d0a5d40008/src/**
>> share/vm/____opto/**loopTransform.cpp<http://hg.openjdk.java.net/____hsx/hotspot-comp/hotspot/file/____89d0a5d40008/src/share/vm/____opto/loopTransform.cpp>
>>        <http://hg.openjdk.java.net/__**hsx/hotspot-comp/hotspot/file/**
>> __89d0a5d40008/src/share/vm/__**opto/loopTransform.cpp<http://hg.openjdk.java.net/__hsx/hotspot-comp/hotspot/file/__89d0a5d40008/src/share/vm/__opto/loopTransform.cpp>
>> >
>>
>>                  <http://hg.openjdk.java.net/__**
>> hsx/hotspot-comp/hotspot/file/**__89d0a5d40008/src/share/vm/__**
>> opto/loopTransform.cpp<http://hg.openjdk.java.net/__hsx/hotspot-comp/hotspot/file/__89d0a5d40008/src/share/vm/__opto/loopTransform.cpp>
>>        <http://hg.openjdk.java.net/**hsx/hotspot-comp/hotspot/file/**
>> 89d0a5d40008/src/share/vm/**opto/loopTransform.cpp<http://hg.openjdk.java.net/hsx/hotspot-comp/hotspot/file/89d0a5d40008/src/share/vm/opto/loopTransform.cpp>
>> >>
>>
>>           We could use deopt as you suggested but deoptimization is
>>        double-edge sword, when method recompiled after
>>           deoptimization some aggressive optimizations will not be
>>        executed for it so the new generated code could be slower.
>>
>>           Regards,
>>           Vladimir
>>
>>
>>           On 1/16/12 2:37 PM, Vitaly Davidovich wrote:
>>
>>               Hi Vladimir,
>>
>>               If x_col is always seen to be same value in the profile
>>        shouldn't the loop be unrolled as well with some deopt
>>               guard? Or
>>               does this not participate in profiling?
>>
>>               Thanks
>>
>>               On Jan 16, 2012 4:57 PM, "Vladimir Kozlov"
>>        <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov@**oracle.com<vladimir.kozlov at oracle.com>
>> >
>>        <mailto:vladimir.kozlov at __orac**le.com <http://oracle.com>
>>        <mailto:vladimir.kozlov@**oracle.com <vladimir.kozlov at oracle.com>
>> >>
>>               <mailto:vladimir.kozlov@
>>        <mailto:vladimir.kozlov@>__ora**c__le.com <http://orac__le.com> <
>> http://oracle.com>
>>        <mailto:vladimir.kozlov at __orac**le.com <http://oracle.com>
>>        <mailto:vladimir.kozlov@**oracle.com <vladimir.kozlov at oracle.com>>>>>
>> wrote:
>>
>>                > be different, because the expressions are similar. The
>>        difference in runtime (due to cols being a compile-time
>>                > constant) will be visible elsewhere. Is that right? If
>>        so, where would I be able to detect this?
>>
>>                   In such situations we usually use some visual tools
>>        to see difference between log outputs. At least you can use
>>               'diff'. You may need to replace instructions addresses in
>>        outputs (number at the beginning of lines) with the same
>>                   value to match. There are few tricks you may use to
>>        get similar PrintOptoAssembly output. Use next flags to
>>               avoid
>>                   mixing output from program output and from 2 compiler
>>        threads (flags stop program until a method is compiled
>>               and run
>>                   only one compiler thread):
>>
>>                   -Xbatch -XX:CICompilerCount=1
>>
>>                   Also add -XX:+PrintCompilation -XX:+PrintInlining to
>>        see what method is compiled and inlined. Note that you
>>               may see
>>                   similar output for individual methods but could be
>>        big difference in compiled caller (computeAll()) method
>>               where 2
>>                   loop methods could be inlined. So you need to compare
>>        all compiled methods.
>>
>>                   In general, to have constant as loop limit is always
>>        win because some checks in generated code could be
>>               avoided and
>>                   more optimizations could be done for such loops. Use
>>        -XX:+TraceLoopOpts to see what loop optimizations are
>>               done in
>>                   both cases.
>>
>>                   For example, in your code you set 'x_col = 3', as
>>        result the next loop in
>>               constructNearestClusterVector(**______) will be
>>
>>
>>                   fully unrolled when this method is inlined into
>>        computeAll() and x_col is replaced with '3':
>>
>>                           for(k = 0; k < x_col; k++) {
>>                             double tmp = x[i*x_col + k] - mu[j* mu_col
>>        + k];
>>                             dist += tmp * tmp;
>>                           }
>>
>>                   Vladimir
>>
>>                   On 1/16/12 1:39 AM, Manohar Jonnalagedda wrote:
>>
>>                       Hi Kris, Vladimir,
>>
>>                       thanks for both your responses.
>>
>>                           Second, your two test methods are different
>>        so you can't directly compare them. method1() iterates
>>               over rows
>>                       using
>>                           middle loop index 'j' and method2() uses
>>        external loop index 'i'. Unless they are typos again.
>>
>>
>>                       You are right, these are indeed typos. As Kris
>>        suggested, I have the code printed here:
>>               http://pastebin.com/xRFD1Nt1.
>>                       The methods corresponding to method1, and method2
>>        are constructNearestClusterVector and
>>               computeNewCentroids. Their
>>                       PrintOptoAssembly outputs are respectively at
>>        http://pastebin.com/1evN8b3K and http://pastebin.com/FxkVWTD5
>>
>>                       Also, it seems I have not explained myself
>>        correctly. I am not trying to compare the performance of
>>               method1 with
>>                       respect
>>                       to that of method2: method1 and method2 both run
>>        in the same program. What I am trying to compare is their
>>                       performance
>>                       in two cases:
>>                       - when cols is a compile-time constant (much faster)
>>                       - when cols is a value determined at run-time
>>
>>                           If you are using jdk7 there are few flags you
>>        can use to print loop optimizations information. They
>>               need debug
>>                           version of VM but it is not problem for you,
>>        I think, since you can use debug PrintOptoAssembly flag.
>>
>>                           -XX:+TraceLoopOpts prints all happened loop
>>        optimizations and loop tree after each round of loop opts,
>>                           -XX:+TraceLoopPredicate prints RC information
>>        when it is moved from a loop,
>>                           -XX:+TraceRangeLimitCheck prints additional
>>        information for RC elimination optimization.
>>
>>
>>                       Thanks for these, I will have a look at what they
>>        output.
>>
>>                           Fourth, range check expression in your
>>        example is not what you think. RC expression should be next:
>>                           (i*stride+offset) where 'i' is loop variable,
>>        'stride' is constant and 'offset' is loop invariant.
>>
>>                           In your example 'offset' is (j * cols) since
>>        it is loop invariant, 'k' is loop variable and stride
>>               is '1' (one).
>>                           In both your methods RC will be moved out of
>>        inner loop so the code for it will be the same. The only
>>                       difference in
>>                           these methods will be where and how (j *
>>        cols) and (i * cols) expressions are calculated.
>>
>>
>>                               I'd guess it's the difference in locality
>>        that made the difference in performance in your two tests.
>>
>>                         Thanks for the explanation. I understand from
>>        the above that the assembly output in both cases
>>               mentioned above
>>                       may not
>>                       be different, because the expressions are
>>        similar. The difference in runtime (due to cols being a
>>               compile-time
>>                       constant)
>>                       will be visible elsewhere. Is that right? If so,
>>        where would I be able to detect this?
>>
>>                       Cheers,
>>                       Manohar
>>
>>                               In your PrintOptoAssembly output snippet,
>>        the instruction at 0x13e is a LoadRange, which loads
>>               the range
>>                       from
>>                               the header
>>                               of an array:
>>
>>                               (from x86_64.ad <http://x86_64.ad>
>>        <http://x86_64.ad> <http://x86_64.ad> <http://x86_64.ad>
>>        <http://x86_64.ad>)
>>
>>
>>
>>                               // Load Range
>>                               instruct loadRange(rRegI dst, memory mem)
>>                               %{
>>                                  match(Set dst (LoadRange mem));
>>
>>                                  ins_cost(125); // XXX
>>                                  format %{ "movl    $dst, $mem\t# range"
>> %}
>>                                  opcode(0x8B);
>>                                  ins_encode(REX_reg_mem(dst, mem),
>>        OpcP, reg_mem(dst, mem));
>>                                  ins_pipe(ialu_reg_mem);
>>                               %}
>>
>>                               That's not a range check just yet; the
>>        real check, if any, should come after the null check, in
>>               the form of
>>                               comparing
>>                               something else with RSI. But you didn't
>>        show what's after the null check, how RSI is used, so
>>               it's hard
>>                       to say what
>>                               you're seeing in your example.
>>
>>                               As for the two test examples, could you
>>        paste the entire source code, with the PrintOptoAssembly
>>               output of
>>                               method1() and
>>                               method2() ? The first example looks
>>        weird, maybe it's a typo but you're using "j < cols" as the loop
>>                       condition
>>                               for the
>>                               inner loop.
>>
>>
>>                               - Kris
>>
>>                               On Mon, Jan 16, 2012 at 1:59 AM, Manohar
>>        Jonnalagedda <manojo10386 at gmail.com <mailto:manojo10386 at gmail.com>
>>               <mailto:manojo10386 at gmail.com <mailto:manojo10386 at gmail.com
>> >**>
>>               <mailto:manojo10386 at gmail.com
>>        <mailto:manojo10386 at gmail.com> <mailto:manojo10386 at gmail.com
>>        <mailto:manojo10386 at gmail.com>**>__> <mailto:manojo10386 at gmail.com
>>        <mailto:manojo10386 at gmail.com>
>>               <mailto:manojo10386 at gmail.com
>>        <mailto:manojo10386 at gmail.com>**> <mailto:manojo10386 at gmail.com
>>        <mailto:manojo10386 at gmail.com> <mailto:manojo10386 at gmail.com
>>        <mailto:manojo10386 at gmail.com>**>__>__>
>>               <mailto:manojo10386 at gmail.com
>>        <mailto:manojo10386 at gmail.com> <mailto:manojo10386 at gmail.com
>>        <mailto:manojo10386 at gmail.com>**> <mailto:manojo10386 at gmail.com
>>        <mailto:manojo10386 at gmail.com>
>>               <mailto:manojo10386 at gmail.com
>>        <mailto:manojo10386 at gmail.com>**>__> <mailto:manojo10386 at gmail.com
>>        <mailto:manojo10386 at gmail.com> <mailto:manojo10386 at gmail.com
>>        <mailto:manojo10386 at gmail.com>**>
>>               <mailto:manojo10386 at gmail.com
>>        <mailto:manojo10386 at gmail.com> <mailto:manojo10386 at gmail.com
>>        <mailto:manojo10386 at gmail.com>**>__>__>__>> wrote:
>>
>>                                   Hello,
>>
>>                                   following this reference on Range
>>        Check Elimination done by the Hotspot compiler [1], I was
>>               keen in
>>                       knowing
>>                               how I
>>                                   can detect whether range checks are
>>        taking place in loops by inspecting output using the
>>                       PrintAssembly flag;
>>                               with
>>                                   the old PrintOptoAssembly flag, I
>>        have seen output such as the following, which I assume to
>>               be range
>>                       checks :
>>
>>                                   B11: #  B73 B12 &lt;- B10  Freq: 1.21365
>>                                   139     movq    RAX, [rsp + #24]  #
>> spill
>>                                   13e     movl    RSI, [RAX + #12
>>        (8-bit)]  # range
>>                                   141     NullCheck RAX
>>
>>                                   What is the equivalent with the new
>>        PrintAssembly flag (using hsdis)?
>>
>>                                   Moreover, as stated on the wiki page
>>        [1], loops are optimized if the stride is a compile-time
>>                       constant. I
>>                               performed
>>                                   a few tests on a kmeans program, with
>>        3 nested loops, having the following (high-level)
>>               structure:
>>
>>                                   ===
>>                                   void method1(){
>>                                   //loop 1
>>                                   for(int i = 0; i< rows1; i++){
>>                                     //...
>>                                      for(int j = 0; j< rows2; j++){
>>                                      //...
>>                                       for(int k = 0; j < cols; k++){
>>        array[j * cols + k] = //...}
>>                                      }
>>                                   }
>>                                   }
>>
>>                                   void method2(){
>>                                   //loop 2
>>                                   for(int i =0; i < rows1; i++){
>>                                      for(int j=0 ; i< rows2; j++){
>>                                        for(int k=0 ; k< cols; k++){
>>                                          array[i*cols+k] = //...
>>                                        }
>>                                      }
>>                                   }
>>                                   }
>>
>>                                   void main(){
>>
>>                                     do{
>>                                       method1(); method2();
>>                                     }while(!converged)
>>
>>                                   }
>>                                   ====
>>
>>                                   In the first test, cols is an int
>>        whose value is determined at runtime (by reading a file),
>>               in the
>>                       second
>>                               test, it
>>                                   is given as a compile-time
>>        constant(3). In the second test, there is a */significant*/
>> speed-up
>>                       (around 40%).
>>
>>                                   However, when studying the diff of
>>        the output of PrintOptoAssembly for both method1 and method2,
>>                       there is no
>>                                   difference (apart from slight value
>>        changes in frequency). Would you have any hints as to
>>               where I
>>                       could look for
>>                                   differences?
>>
>>                                   Thanks a lot,
>>                                   Manohar
>>
>>                                   [1]
>>        https://wikis.oracle.com/_____**_display/HotSpotInternals/____**
>> __RangeCheckElimination<https://wikis.oracle.com/______display/HotSpotInternals/______RangeCheckElimination>
>>        <https://wikis.oracle.com/____**display/HotSpotInternals/____**
>> RangeCheckElimination<https://wikis.oracle.com/____display/HotSpotInternals/____RangeCheckElimination>
>> >
>>                      <https://wikis.oracle.com/____**
>> display/HotSpotInternals/____**RangeCheckElimination<https://wikis.oracle.com/____display/HotSpotInternals/____RangeCheckElimination>
>>        <https://wikis.oracle.com/__**display/HotSpotInternals/__**
>> RangeCheckElimination<https://wikis.oracle.com/__display/HotSpotInternals/__RangeCheckElimination>
>> >>
>>
>>
>>                      <https://wikis.oracle.com/____**
>> display/HotSpotInternals/____**RangeCheckElimination<https://wikis.oracle.com/____display/HotSpotInternals/____RangeCheckElimination>
>>        <https://wikis.oracle.com/__**display/HotSpotInternals/__**
>> RangeCheckElimination<https://wikis.oracle.com/__display/HotSpotInternals/__RangeCheckElimination>
>> >
>>                      <https://wikis.oracle.com/__**
>> display/HotSpotInternals/__**RangeCheckElimination<https://wikis.oracle.com/__display/HotSpotInternals/__RangeCheckElimination>
>>        <https://wikis.oracle.com/**display/HotSpotInternals/**
>> RangeCheckElimination<https://wikis.oracle.com/display/HotSpotInternals/RangeCheckElimination>
>> >>>
>>
>>
>>
>>                       As Kris pointed you need to fix your example:
>>
>>
>>
>>
>>        --
>>        Vitaly
>>        617-548-7007 <tel:617-548-7007> (mobile)
>>
>>
>>
>>
>> --
>> Vitaly
>> 617-548-7007 (mobile)
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-dev/attachments/20120117/033303e0/attachment-0001.html 


More information about the hotspot-dev mailing list