Detecting range check elimination with PrintAssembly
Vladimir Kozlov
vladimir.kozlov at oracle.com
Tue Jan 17 10:02:19 PST 2012
Vitaly Davidovich wrote:
> Vladimir, thanks for that -- very useful and interesting to know. So I
> take it a compilation after initial deopt may be less aggressive than
> 1st time around simply because the accumulated MD might be in a state
> such that it guides optimization in a different direction, is that right?
Yes, it is right, especially for inlining dynamic calls based on target
profiling when after some time an other method is called. We also have limit how
many times method could be recompiled after which the method will be only
interpreted. Also we have limit on how many uncommon traps (deoptimizations)
happened in method after which we stop generating some optimizations which use
uncommon traps. Usually it means generating slow paths in compiled code which we
thought (based on original profiling) should not be executed (so we generated
uncommon trap). As result compiled code size will be changed and inlining also.
Note, it happens not immediately but only after significant number of
deoptimizations: PerMethodTrapLimit is 100 and PerMethodRecompilationCutoff is 400.
>
> I guess how does one reason about the generated assembly when it's so
> (possibly) dynamic? :) I've looked at the assembly in a product hotspot
> via hsdis, but then you realize that you may be looking at just one
> compilation of it, and it may be different at some point later. Also,
> for something like Intel's VTune (or Amplifier) that support showing
> JIT'd assembly, I guess it's the same issue? This is quite different, of
> course, with .NET CLR's JIT because there's no profiling there -- 1st
> time method is hit, it gets JIT'd and there's no code pitching (makes
> looking at the assembly seem a bit more "reassuring").
VTune shows several compiled methods. We have our profiling tool on Solaris
(Analizer) which do the same. Note that deoptimization and recompilation usually
happen only during startup. So only latest compiled version used in long run.
>
> In general, would you say that the C2 compiler favors "over
> aggressiveness" with deopt guards? Or does it try to avoid deopts and
> only perform aggressive optimizations where the profile is quite
> conclusive on type and/or counters?
C2 does compilations as aggressive as possible. And it will be less aggressive
only during later recompilations.
Regards,
Vladimir
>
> Thanks
>
> On Mon, Jan 16, 2012 at 11:52 PM, Vladimir Kozlov
> <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote:
>
> I need to be more clear. We can't get 100% correct profiling result
> for counters. First, profiling structure (MethodData) are created
> only after method is executed for some time (for Server compiler it
> is 3000 invocations). It is done to avoid profiling short programs
> since profiling will slowdown Interpreter by about 20%. This delayed
> profiling introduce first discrepancy between counters. MD could be
> also created when method has a hot loop (a lot of iterations) and
> OSR compilation is requested. As result counters in the loop will
> not correlate with counters before the loop. An other reason for not
> precise profiling counters is execution of method by multiple
> threads. We have only one MD structure per method which is updated
> (not atomically for speed) by all such threads. And when method is
> prepared for compilation compiler thread creates snapshot of MD
> which could happened in the middle of method execution by java threads.
>
> We don't reset MD during deoptimization because we still need
> previous values (in addition to counters we collect type
> information). So counters are accumulated. And profiling is resumed
> from deoptimization point which again adds discrepancy.
>
> I hope it explains why we can't trust 100% to profiling counters.
>
> Regards,
> Vladimir
>
>
> On 1/16/12 4:15 PM, Vitaly Davidovich wrote:
>
> Vladimir, thanks for the explanation and the code pointer.
>
> Intuitively, it would seem like a good idea to trust the profile
> 100% if it reports the same value used 100% of the time
> (I can see how anything less than 100%, even a very high
> probability of same value, is not trustworthy) given sufficient
> trips through the loop. Although I can see how an app may have
> phases where same value is seen for a while before it's
> switched, but that's where I thought deopt would help. There
> must be a good chunk of code out there that doesn't know
> at static compilation time the loop count (so can't use
> compile-time constant), but at runtime the actual value doesn't
> change for many many trips through the loop; I know I have code
> like that in various places.
>
> What's the reason a compilation after deopt would not be as
> aggressive as the 1st time? Is it because the profile
> information may be "weaker" (i.e. more uncertainty in it)? I
> thought the profile is completely reset after deopt, so I
> would think if the loop is now executed with a different
> "constant" value (e.g. in our example, instead of 3 it's now
> 4), then the same optimizations will be applied (of course if
> unrolling the loop is no longer advantageous due to a much
> different value, I can see how different optimizations will be
> applied).
>
> Thanks
>
> On Mon, Jan 16, 2012 at 6:45 PM, Vladimir Kozlov
> <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>
> <mailto:vladimir.kozlov at __oracle.com
> <mailto:vladimir.kozlov at oracle.com>>> wrote:
>
> Vitaly,
>
> We do use profile_trip_cnt during loop unroll calculation but
> not during fully unroll because we can't trust it 100%
> since program's phase and number of iterations could change
> after method is compiled. See policy_unroll() and
> policy_maximally_unroll():
>
>
> http://hg.openjdk.java.net/____hsx/hotspot-comp/hotspot/file/____89d0a5d40008/src/share/vm/____opto/loopTransform.cpp
> <http://hg.openjdk.java.net/__hsx/hotspot-comp/hotspot/file/__89d0a5d40008/src/share/vm/__opto/loopTransform.cpp>
>
>
> <http://hg.openjdk.java.net/__hsx/hotspot-comp/hotspot/file/__89d0a5d40008/src/share/vm/__opto/loopTransform.cpp
> <http://hg.openjdk.java.net/hsx/hotspot-comp/hotspot/file/89d0a5d40008/src/share/vm/opto/loopTransform.cpp>>
>
> We could use deopt as you suggested but deoptimization is
> double-edge sword, when method recompiled after
> deoptimization some aggressive optimizations will not be
> executed for it so the new generated code could be slower.
>
> Regards,
> Vladimir
>
>
> On 1/16/12 2:37 PM, Vitaly Davidovich wrote:
>
> Hi Vladimir,
>
> If x_col is always seen to be same value in the profile
> shouldn't the loop be unrolled as well with some deopt
> guard? Or
> does this not participate in profiling?
>
> Thanks
>
> On Jan 16, 2012 4:57 PM, "Vladimir Kozlov"
> <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>
> <mailto:vladimir.kozlov at __oracle.com
> <mailto:vladimir.kozlov at oracle.com>>
> <mailto:vladimir.kozlov@
> <mailto:vladimir.kozlov@>__orac__le.com <http://oracle.com>
> <mailto:vladimir.kozlov at __oracle.com
> <mailto:vladimir.kozlov at oracle.com>>>> wrote:
>
> > be different, because the expressions are similar. The
> difference in runtime (due to cols being a compile-time
> > constant) will be visible elsewhere. Is that right? If
> so, where would I be able to detect this?
>
> In such situations we usually use some visual tools
> to see difference between log outputs. At least you can use
> 'diff'. You may need to replace instructions addresses in
> outputs (number at the beginning of lines) with the same
> value to match. There are few tricks you may use to
> get similar PrintOptoAssembly output. Use next flags to
> avoid
> mixing output from program output and from 2 compiler
> threads (flags stop program until a method is compiled
> and run
> only one compiler thread):
>
> -Xbatch -XX:CICompilerCount=1
>
> Also add -XX:+PrintCompilation -XX:+PrintInlining to
> see what method is compiled and inlined. Note that you
> may see
> similar output for individual methods but could be
> big difference in compiled caller (computeAll()) method
> where 2
> loop methods could be inlined. So you need to compare
> all compiled methods.
>
> In general, to have constant as loop limit is always
> win because some checks in generated code could be
> avoided and
> more optimizations could be done for such loops. Use
> -XX:+TraceLoopOpts to see what loop optimizations are
> done in
> both cases.
>
> For example, in your code you set 'x_col = 3', as
> result the next loop in
> constructNearestClusterVector(______) will be
>
>
> fully unrolled when this method is inlined into
> computeAll() and x_col is replaced with '3':
>
> for(k = 0; k < x_col; k++) {
> double tmp = x[i*x_col + k] - mu[j* mu_col
> + k];
> dist += tmp * tmp;
> }
>
> Vladimir
>
> On 1/16/12 1:39 AM, Manohar Jonnalagedda wrote:
>
> Hi Kris, Vladimir,
>
> thanks for both your responses.
>
> Second, your two test methods are different
> so you can't directly compare them. method1() iterates
> over rows
> using
> middle loop index 'j' and method2() uses
> external loop index 'i'. Unless they are typos again.
>
>
> You are right, these are indeed typos. As Kris
> suggested, I have the code printed here:
> http://pastebin.com/xRFD1Nt1.
> The methods corresponding to method1, and method2
> are constructNearestClusterVector and
> computeNewCentroids. Their
> PrintOptoAssembly outputs are respectively at
> http://pastebin.com/1evN8b3K and http://pastebin.com/FxkVWTD5
>
> Also, it seems I have not explained myself
> correctly. I am not trying to compare the performance of
> method1 with
> respect
> to that of method2: method1 and method2 both run
> in the same program. What I am trying to compare is their
> performance
> in two cases:
> - when cols is a compile-time constant (much faster)
> - when cols is a value determined at run-time
>
> If you are using jdk7 there are few flags you
> can use to print loop optimizations information. They
> need debug
> version of VM but it is not problem for you,
> I think, since you can use debug PrintOptoAssembly flag.
>
> -XX:+TraceLoopOpts prints all happened loop
> optimizations and loop tree after each round of loop opts,
> -XX:+TraceLoopPredicate prints RC information
> when it is moved from a loop,
> -XX:+TraceRangeLimitCheck prints additional
> information for RC elimination optimization.
>
>
> Thanks for these, I will have a look at what they
> output.
>
> Fourth, range check expression in your
> example is not what you think. RC expression should be next:
> (i*stride+offset) where 'i' is loop variable,
> 'stride' is constant and 'offset' is loop invariant.
>
> In your example 'offset' is (j * cols) since
> it is loop invariant, 'k' is loop variable and stride
> is '1' (one).
> In both your methods RC will be moved out of
> inner loop so the code for it will be the same. The only
> difference in
> these methods will be where and how (j *
> cols) and (i * cols) expressions are calculated.
>
>
> I'd guess it's the difference in locality
> that made the difference in performance in your two tests.
>
> Thanks for the explanation. I understand from
> the above that the assembly output in both cases
> mentioned above
> may not
> be different, because the expressions are
> similar. The difference in runtime (due to cols being a
> compile-time
> constant)
> will be visible elsewhere. Is that right? If so,
> where would I be able to detect this?
>
> Cheers,
> Manohar
>
> In your PrintOptoAssembly output snippet,
> the instruction at 0x13e is a LoadRange, which loads
> the range
> from
> the header
> of an array:
>
> (from x86_64.ad <http://x86_64.ad>
> <http://x86_64.ad> <http://x86_64.ad> <http://x86_64.ad>
> <http://x86_64.ad>)
>
>
>
> // Load Range
> instruct loadRange(rRegI dst, memory mem)
> %{
> match(Set dst (LoadRange mem));
>
> ins_cost(125); // XXX
> format %{ "movl $dst, $mem\t# range" %}
> opcode(0x8B);
> ins_encode(REX_reg_mem(dst, mem),
> OpcP, reg_mem(dst, mem));
> ins_pipe(ialu_reg_mem);
> %}
>
> That's not a range check just yet; the
> real check, if any, should come after the null check, in
> the form of
> comparing
> something else with RSI. But you didn't
> show what's after the null check, how RSI is used, so
> it's hard
> to say what
> you're seeing in your example.
>
> As for the two test examples, could you
> paste the entire source code, with the PrintOptoAssembly
> output of
> method1() and
> method2() ? The first example looks
> weird, maybe it's a typo but you're using "j < cols" as the loop
> condition
> for the
> inner loop.
>
>
> - Kris
>
> On Mon, Jan 16, 2012 at 1:59 AM, Manohar
> Jonnalagedda <manojo10386 at gmail.com <mailto:manojo10386 at gmail.com>
> <mailto:manojo10386 at gmail.com <mailto:manojo10386 at gmail.com>>
> <mailto:manojo10386 at gmail.com
> <mailto:manojo10386 at gmail.com> <mailto:manojo10386 at gmail.com
> <mailto:manojo10386 at gmail.com>>__> <mailto:manojo10386 at gmail.com
> <mailto:manojo10386 at gmail.com>
> <mailto:manojo10386 at gmail.com
> <mailto:manojo10386 at gmail.com>> <mailto:manojo10386 at gmail.com
> <mailto:manojo10386 at gmail.com> <mailto:manojo10386 at gmail.com
> <mailto:manojo10386 at gmail.com>>__>__>
> <mailto:manojo10386 at gmail.com
> <mailto:manojo10386 at gmail.com> <mailto:manojo10386 at gmail.com
> <mailto:manojo10386 at gmail.com>> <mailto:manojo10386 at gmail.com
> <mailto:manojo10386 at gmail.com>
> <mailto:manojo10386 at gmail.com
> <mailto:manojo10386 at gmail.com>>__> <mailto:manojo10386 at gmail.com
> <mailto:manojo10386 at gmail.com> <mailto:manojo10386 at gmail.com
> <mailto:manojo10386 at gmail.com>>
> <mailto:manojo10386 at gmail.com
> <mailto:manojo10386 at gmail.com> <mailto:manojo10386 at gmail.com
> <mailto:manojo10386 at gmail.com>>__>__>__>> wrote:
>
> Hello,
>
> following this reference on Range
> Check Elimination done by the Hotspot compiler [1], I was
> keen in
> knowing
> how I
> can detect whether range checks are
> taking place in loops by inspecting output using the
> PrintAssembly flag;
> with
> the old PrintOptoAssembly flag, I
> have seen output such as the following, which I assume to
> be range
> checks :
>
> B11: # B73 B12 <- B10 Freq: 1.21365
> 139 movq RAX, [rsp + #24] # spill
> 13e movl RSI, [RAX + #12
> (8-bit)] # range
> 141 NullCheck RAX
>
> What is the equivalent with the new
> PrintAssembly flag (using hsdis)?
>
> Moreover, as stated on the wiki page
> [1], loops are optimized if the stride is a compile-time
> constant. I
> performed
> a few tests on a kmeans program, with
> 3 nested loops, having the following (high-level)
> structure:
>
> ===
> void method1(){
> //loop 1
> for(int i = 0; i< rows1; i++){
> //...
> for(int j = 0; j< rows2; j++){
> //...
> for(int k = 0; j < cols; k++){
> array[j * cols + k] = //...}
> }
> }
> }
>
> void method2(){
> //loop 2
> for(int i =0; i < rows1; i++){
> for(int j=0 ; i< rows2; j++){
> for(int k=0 ; k< cols; k++){
> array[i*cols+k] = //...
> }
> }
> }
> }
>
> void main(){
>
> do{
> method1(); method2();
> }while(!converged)
>
> }
> ====
>
> In the first test, cols is an int
> whose value is determined at runtime (by reading a file),
> in the
> second
> test, it
> is given as a compile-time
> constant(3). In the second test, there is a */significant*/ speed-up
> (around 40%).
>
> However, when studying the diff of
> the output of PrintOptoAssembly for both method1 and method2,
> there is no
> difference (apart from slight value
> changes in frequency). Would you have any hints as to
> where I
> could look for
> differences?
>
> Thanks a lot,
> Manohar
>
> [1]
> https://wikis.oracle.com/______display/HotSpotInternals/______RangeCheckElimination
> <https://wikis.oracle.com/____display/HotSpotInternals/____RangeCheckElimination>
>
> <https://wikis.oracle.com/____display/HotSpotInternals/____RangeCheckElimination
> <https://wikis.oracle.com/__display/HotSpotInternals/__RangeCheckElimination>>
>
>
>
> <https://wikis.oracle.com/____display/HotSpotInternals/____RangeCheckElimination
> <https://wikis.oracle.com/__display/HotSpotInternals/__RangeCheckElimination>
>
> <https://wikis.oracle.com/__display/HotSpotInternals/__RangeCheckElimination
> <https://wikis.oracle.com/display/HotSpotInternals/RangeCheckElimination>>>
>
>
>
> As Kris pointed you need to fix your example:
>
>
>
>
> --
> Vitaly
> 617-548-7007 <tel:617-548-7007> (mobile)
>
>
>
>
> --
> Vitaly
> 617-548-7007 (mobile)
More information about the hotspot-dev
mailing list