Detecting range check elimination with PrintAssembly
Vitaly Davidovich
vitalyd at gmail.com
Tue Jan 17 10:10:05 PST 2012
Awesome info - thanks very much for taking the time to answer.
Sent from my phone
On Jan 17, 2012 1:02 PM, "Vladimir Kozlov" <vladimir.kozlov at oracle.com>
wrote:
> Vitaly Davidovich wrote:
>
>> Vladimir, thanks for that -- very useful and interesting to know. So I
>> take it a compilation after initial deopt may be less aggressive than 1st
>> time around simply because the accumulated MD might be in a state such that
>> it guides optimization in a different direction, is that right?
>>
>
> Yes, it is right, especially for inlining dynamic calls based on target
> profiling when after some time an other method is called. We also have
> limit how many times method could be recompiled after which the method will
> be only interpreted. Also we have limit on how many uncommon traps
> (deoptimizations) happened in method after which we stop generating some
> optimizations which use uncommon traps. Usually it means generating slow
> paths in compiled code which we thought (based on original profiling)
> should not be executed (so we generated uncommon trap). As result compiled
> code size will be changed and inlining also. Note, it happens not
> immediately but only after significant number of deoptimizations:
> PerMethodTrapLimit is 100 and PerMethodRecompilationCutoff is 400.
>
>
>> I guess how does one reason about the generated assembly when it's so
>> (possibly) dynamic? :) I've looked at the assembly in a product hotspot via
>> hsdis, but then you realize that you may be looking at just one compilation
>> of it, and it may be different at some point later. Also, for something
>> like Intel's VTune (or Amplifier) that support showing JIT'd assembly, I
>> guess it's the same issue? This is quite different, of course, with .NET
>> CLR's JIT because there's no profiling there -- 1st time method is hit, it
>> gets JIT'd and there's no code pitching (makes looking at the assembly seem
>> a bit more "reassuring").
>>
>
> VTune shows several compiled methods. We have our profiling tool on
> Solaris (Analizer) which do the same. Note that deoptimization and
> recompilation usually happen only during startup. So only latest compiled
> version used in long run.
>
>
>> In general, would you say that the C2 compiler favors "over
>> aggressiveness" with deopt guards? Or does it try to avoid deopts and only
>> perform aggressive optimizations where the profile is quite conclusive on
>> type and/or counters?
>>
>
> C2 does compilations as aggressive as possible. And it will be less
> aggressive only during later recompilations.
>
> Regards,
> Vladimir
>
>
>> Thanks
>>
>> On Mon, Jan 16, 2012 at 11:52 PM, Vladimir Kozlov <
>> vladimir.kozlov at oracle.com <mailto:vladimir.kozlov@**oracle.com<vladimir.kozlov at oracle.com>>>
>> wrote:
>>
>> I need to be more clear. We can't get 100% correct profiling result
>> for counters. First, profiling structure (MethodData) are created
>> only after method is executed for some time (for Server compiler it
>> is 3000 invocations). It is done to avoid profiling short programs
>> since profiling will slowdown Interpreter by about 20%. This delayed
>> profiling introduce first discrepancy between counters. MD could be
>> also created when method has a hot loop (a lot of iterations) and
>> OSR compilation is requested. As result counters in the loop will
>> not correlate with counters before the loop. An other reason for not
>> precise profiling counters is execution of method by multiple
>> threads. We have only one MD structure per method which is updated
>> (not atomically for speed) by all such threads. And when method is
>> prepared for compilation compiler thread creates snapshot of MD
>> which could happened in the middle of method execution by java threads.
>>
>> We don't reset MD during deoptimization because we still need
>> previous values (in addition to counters we collect type
>> information). So counters are accumulated. And profiling is resumed
>> from deoptimization point which again adds discrepancy.
>>
>> I hope it explains why we can't trust 100% to profiling counters.
>>
>> Regards,
>> Vladimir
>>
>>
>> On 1/16/12 4:15 PM, Vitaly Davidovich wrote:
>>
>> Vladimir, thanks for the explanation and the code pointer.
>>
>> Intuitively, it would seem like a good idea to trust the profile
>> 100% if it reports the same value used 100% of the time
>> (I can see how anything less than 100%, even a very high
>> probability of same value, is not trustworthy) given sufficient
>> trips through the loop. Although I can see how an app may have
>> phases where same value is seen for a while before it's
>> switched, but that's where I thought deopt would help. There
>> must be a good chunk of code out there that doesn't know
>> at static compilation time the loop count (so can't use
>> compile-time constant), but at runtime the actual value doesn't
>> change for many many trips through the loop; I know I have code
>> like that in various places.
>>
>> What's the reason a compilation after deopt would not be as
>> aggressive as the 1st time? Is it because the profile
>> information may be "weaker" (i.e. more uncertainty in it)? I
>> thought the profile is completely reset after deopt, so I
>> would think if the loop is now executed with a different
>> "constant" value (e.g. in our example, instead of 3 it's now
>> 4), then the same optimizations will be applied (of course if
>> unrolling the loop is no longer advantageous due to a much
>> different value, I can see how different optimizations will be
>> applied).
>>
>> Thanks
>>
>> On Mon, Jan 16, 2012 at 6:45 PM, Vladimir Kozlov
>> <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov@**oracle.com<vladimir.kozlov at oracle.com>
>> >
>> <mailto:vladimir.kozlov at __orac**le.com <http://oracle.com>
>> <mailto:vladimir.kozlov@**oracle.com <vladimir.kozlov at oracle.com>>>>
>> wrote:
>>
>> Vitaly,
>>
>> We do use profile_trip_cnt during loop unroll calculation but
>> not during fully unroll because we can't trust it 100%
>> since program's phase and number of iterations could change
>> after method is compiled. See policy_unroll() and
>> policy_maximally_unroll():
>>
>> http://hg.openjdk.java.net/___**
>> _hsx/hotspot-comp/hotspot/**file/____89d0a5d40008/src/**
>> share/vm/____opto/**loopTransform.cpp<http://hg.openjdk.java.net/____hsx/hotspot-comp/hotspot/file/____89d0a5d40008/src/share/vm/____opto/loopTransform.cpp>
>> <http://hg.openjdk.java.net/__**hsx/hotspot-comp/hotspot/file/**
>> __89d0a5d40008/src/share/vm/__**opto/loopTransform.cpp<http://hg.openjdk.java.net/__hsx/hotspot-comp/hotspot/file/__89d0a5d40008/src/share/vm/__opto/loopTransform.cpp>
>> >
>>
>> <http://hg.openjdk.java.net/__**
>> hsx/hotspot-comp/hotspot/file/**__89d0a5d40008/src/share/vm/__**
>> opto/loopTransform.cpp<http://hg.openjdk.java.net/__hsx/hotspot-comp/hotspot/file/__89d0a5d40008/src/share/vm/__opto/loopTransform.cpp>
>> <http://hg.openjdk.java.net/**hsx/hotspot-comp/hotspot/file/**
>> 89d0a5d40008/src/share/vm/**opto/loopTransform.cpp<http://hg.openjdk.java.net/hsx/hotspot-comp/hotspot/file/89d0a5d40008/src/share/vm/opto/loopTransform.cpp>
>> >>
>>
>> We could use deopt as you suggested but deoptimization is
>> double-edge sword, when method recompiled after
>> deoptimization some aggressive optimizations will not be
>> executed for it so the new generated code could be slower.
>>
>> Regards,
>> Vladimir
>>
>>
>> On 1/16/12 2:37 PM, Vitaly Davidovich wrote:
>>
>> Hi Vladimir,
>>
>> If x_col is always seen to be same value in the profile
>> shouldn't the loop be unrolled as well with some deopt
>> guard? Or
>> does this not participate in profiling?
>>
>> Thanks
>>
>> On Jan 16, 2012 4:57 PM, "Vladimir Kozlov"
>> <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov@**oracle.com<vladimir.kozlov at oracle.com>
>> >
>> <mailto:vladimir.kozlov at __orac**le.com <http://oracle.com>
>> <mailto:vladimir.kozlov@**oracle.com <vladimir.kozlov at oracle.com>
>> >>
>> <mailto:vladimir.kozlov@
>> <mailto:vladimir.kozlov@>__ora**c__le.com <http://orac__le.com> <
>> http://oracle.com>
>> <mailto:vladimir.kozlov at __orac**le.com <http://oracle.com>
>> <mailto:vladimir.kozlov@**oracle.com <vladimir.kozlov at oracle.com>>>>>
>> wrote:
>>
>> > be different, because the expressions are similar. The
>> difference in runtime (due to cols being a compile-time
>> > constant) will be visible elsewhere. Is that right? If
>> so, where would I be able to detect this?
>>
>> In such situations we usually use some visual tools
>> to see difference between log outputs. At least you can use
>> 'diff'. You may need to replace instructions addresses in
>> outputs (number at the beginning of lines) with the same
>> value to match. There are few tricks you may use to
>> get similar PrintOptoAssembly output. Use next flags to
>> avoid
>> mixing output from program output and from 2 compiler
>> threads (flags stop program until a method is compiled
>> and run
>> only one compiler thread):
>>
>> -Xbatch -XX:CICompilerCount=1
>>
>> Also add -XX:+PrintCompilation -XX:+PrintInlining to
>> see what method is compiled and inlined. Note that you
>> may see
>> similar output for individual methods but could be
>> big difference in compiled caller (computeAll()) method
>> where 2
>> loop methods could be inlined. So you need to compare
>> all compiled methods.
>>
>> In general, to have constant as loop limit is always
>> win because some checks in generated code could be
>> avoided and
>> more optimizations could be done for such loops. Use
>> -XX:+TraceLoopOpts to see what loop optimizations are
>> done in
>> both cases.
>>
>> For example, in your code you set 'x_col = 3', as
>> result the next loop in
>> constructNearestClusterVector(**______) will be
>>
>>
>> fully unrolled when this method is inlined into
>> computeAll() and x_col is replaced with '3':
>>
>> for(k = 0; k < x_col; k++) {
>> double tmp = x[i*x_col + k] - mu[j* mu_col
>> + k];
>> dist += tmp * tmp;
>> }
>>
>> Vladimir
>>
>> On 1/16/12 1:39 AM, Manohar Jonnalagedda wrote:
>>
>> Hi Kris, Vladimir,
>>
>> thanks for both your responses.
>>
>> Second, your two test methods are different
>> so you can't directly compare them. method1() iterates
>> over rows
>> using
>> middle loop index 'j' and method2() uses
>> external loop index 'i'. Unless they are typos again.
>>
>>
>> You are right, these are indeed typos. As Kris
>> suggested, I have the code printed here:
>> http://pastebin.com/xRFD1Nt1.
>> The methods corresponding to method1, and method2
>> are constructNearestClusterVector and
>> computeNewCentroids. Their
>> PrintOptoAssembly outputs are respectively at
>> http://pastebin.com/1evN8b3K and http://pastebin.com/FxkVWTD5
>>
>> Also, it seems I have not explained myself
>> correctly. I am not trying to compare the performance of
>> method1 with
>> respect
>> to that of method2: method1 and method2 both run
>> in the same program. What I am trying to compare is their
>> performance
>> in two cases:
>> - when cols is a compile-time constant (much faster)
>> - when cols is a value determined at run-time
>>
>> If you are using jdk7 there are few flags you
>> can use to print loop optimizations information. They
>> need debug
>> version of VM but it is not problem for you,
>> I think, since you can use debug PrintOptoAssembly flag.
>>
>> -XX:+TraceLoopOpts prints all happened loop
>> optimizations and loop tree after each round of loop opts,
>> -XX:+TraceLoopPredicate prints RC information
>> when it is moved from a loop,
>> -XX:+TraceRangeLimitCheck prints additional
>> information for RC elimination optimization.
>>
>>
>> Thanks for these, I will have a look at what they
>> output.
>>
>> Fourth, range check expression in your
>> example is not what you think. RC expression should be next:
>> (i*stride+offset) where 'i' is loop variable,
>> 'stride' is constant and 'offset' is loop invariant.
>>
>> In your example 'offset' is (j * cols) since
>> it is loop invariant, 'k' is loop variable and stride
>> is '1' (one).
>> In both your methods RC will be moved out of
>> inner loop so the code for it will be the same. The only
>> difference in
>> these methods will be where and how (j *
>> cols) and (i * cols) expressions are calculated.
>>
>>
>> I'd guess it's the difference in locality
>> that made the difference in performance in your two tests.
>>
>> Thanks for the explanation. I understand from
>> the above that the assembly output in both cases
>> mentioned above
>> may not
>> be different, because the expressions are
>> similar. The difference in runtime (due to cols being a
>> compile-time
>> constant)
>> will be visible elsewhere. Is that right? If so,
>> where would I be able to detect this?
>>
>> Cheers,
>> Manohar
>>
>> In your PrintOptoAssembly output snippet,
>> the instruction at 0x13e is a LoadRange, which loads
>> the range
>> from
>> the header
>> of an array:
>>
>> (from x86_64.ad <http://x86_64.ad>
>> <http://x86_64.ad> <http://x86_64.ad> <http://x86_64.ad>
>> <http://x86_64.ad>)
>>
>>
>>
>> // Load Range
>> instruct loadRange(rRegI dst, memory mem)
>> %{
>> match(Set dst (LoadRange mem));
>>
>> ins_cost(125); // XXX
>> format %{ "movl $dst, $mem\t# range"
>> %}
>> opcode(0x8B);
>> ins_encode(REX_reg_mem(dst, mem),
>> OpcP, reg_mem(dst, mem));
>> ins_pipe(ialu_reg_mem);
>> %}
>>
>> That's not a range check just yet; the
>> real check, if any, should come after the null check, in
>> the form of
>> comparing
>> something else with RSI. But you didn't
>> show what's after the null check, how RSI is used, so
>> it's hard
>> to say what
>> you're seeing in your example.
>>
>> As for the two test examples, could you
>> paste the entire source code, with the PrintOptoAssembly
>> output of
>> method1() and
>> method2() ? The first example looks
>> weird, maybe it's a typo but you're using "j < cols" as the loop
>> condition
>> for the
>> inner loop.
>>
>>
>> - Kris
>>
>> On Mon, Jan 16, 2012 at 1:59 AM, Manohar
>> Jonnalagedda <manojo10386 at gmail.com <mailto:manojo10386 at gmail.com>
>> <mailto:manojo10386 at gmail.com <mailto:manojo10386 at gmail.com
>> >**>
>> <mailto:manojo10386 at gmail.com
>> <mailto:manojo10386 at gmail.com> <mailto:manojo10386 at gmail.com
>> <mailto:manojo10386 at gmail.com>**>__> <mailto:manojo10386 at gmail.com
>> <mailto:manojo10386 at gmail.com>
>> <mailto:manojo10386 at gmail.com
>> <mailto:manojo10386 at gmail.com>**> <mailto:manojo10386 at gmail.com
>> <mailto:manojo10386 at gmail.com> <mailto:manojo10386 at gmail.com
>> <mailto:manojo10386 at gmail.com>**>__>__>
>> <mailto:manojo10386 at gmail.com
>> <mailto:manojo10386 at gmail.com> <mailto:manojo10386 at gmail.com
>> <mailto:manojo10386 at gmail.com>**> <mailto:manojo10386 at gmail.com
>> <mailto:manojo10386 at gmail.com>
>> <mailto:manojo10386 at gmail.com
>> <mailto:manojo10386 at gmail.com>**>__> <mailto:manojo10386 at gmail.com
>> <mailto:manojo10386 at gmail.com> <mailto:manojo10386 at gmail.com
>> <mailto:manojo10386 at gmail.com>**>
>> <mailto:manojo10386 at gmail.com
>> <mailto:manojo10386 at gmail.com> <mailto:manojo10386 at gmail.com
>> <mailto:manojo10386 at gmail.com>**>__>__>__>> wrote:
>>
>> Hello,
>>
>> following this reference on Range
>> Check Elimination done by the Hotspot compiler [1], I was
>> keen in
>> knowing
>> how I
>> can detect whether range checks are
>> taking place in loops by inspecting output using the
>> PrintAssembly flag;
>> with
>> the old PrintOptoAssembly flag, I
>> have seen output such as the following, which I assume to
>> be range
>> checks :
>>
>> B11: # B73 B12 <- B10 Freq: 1.21365
>> 139 movq RAX, [rsp + #24] #
>> spill
>> 13e movl RSI, [RAX + #12
>> (8-bit)] # range
>> 141 NullCheck RAX
>>
>> What is the equivalent with the new
>> PrintAssembly flag (using hsdis)?
>>
>> Moreover, as stated on the wiki page
>> [1], loops are optimized if the stride is a compile-time
>> constant. I
>> performed
>> a few tests on a kmeans program, with
>> 3 nested loops, having the following (high-level)
>> structure:
>>
>> ===
>> void method1(){
>> //loop 1
>> for(int i = 0; i< rows1; i++){
>> //...
>> for(int j = 0; j< rows2; j++){
>> //...
>> for(int k = 0; j < cols; k++){
>> array[j * cols + k] = //...}
>> }
>> }
>> }
>>
>> void method2(){
>> //loop 2
>> for(int i =0; i < rows1; i++){
>> for(int j=0 ; i< rows2; j++){
>> for(int k=0 ; k< cols; k++){
>> array[i*cols+k] = //...
>> }
>> }
>> }
>> }
>>
>> void main(){
>>
>> do{
>> method1(); method2();
>> }while(!converged)
>>
>> }
>> ====
>>
>> In the first test, cols is an int
>> whose value is determined at runtime (by reading a file),
>> in the
>> second
>> test, it
>> is given as a compile-time
>> constant(3). In the second test, there is a */significant*/
>> speed-up
>> (around 40%).
>>
>> However, when studying the diff of
>> the output of PrintOptoAssembly for both method1 and method2,
>> there is no
>> difference (apart from slight value
>> changes in frequency). Would you have any hints as to
>> where I
>> could look for
>> differences?
>>
>> Thanks a lot,
>> Manohar
>>
>> [1]
>> https://wikis.oracle.com/_____**_display/HotSpotInternals/____**
>> __RangeCheckElimination<https://wikis.oracle.com/______display/HotSpotInternals/______RangeCheckElimination>
>> <https://wikis.oracle.com/____**display/HotSpotInternals/____**
>> RangeCheckElimination<https://wikis.oracle.com/____display/HotSpotInternals/____RangeCheckElimination>
>> >
>> <https://wikis.oracle.com/____**
>> display/HotSpotInternals/____**RangeCheckElimination<https://wikis.oracle.com/____display/HotSpotInternals/____RangeCheckElimination>
>> <https://wikis.oracle.com/__**display/HotSpotInternals/__**
>> RangeCheckElimination<https://wikis.oracle.com/__display/HotSpotInternals/__RangeCheckElimination>
>> >>
>>
>>
>> <https://wikis.oracle.com/____**
>> display/HotSpotInternals/____**RangeCheckElimination<https://wikis.oracle.com/____display/HotSpotInternals/____RangeCheckElimination>
>> <https://wikis.oracle.com/__**display/HotSpotInternals/__**
>> RangeCheckElimination<https://wikis.oracle.com/__display/HotSpotInternals/__RangeCheckElimination>
>> >
>> <https://wikis.oracle.com/__**
>> display/HotSpotInternals/__**RangeCheckElimination<https://wikis.oracle.com/__display/HotSpotInternals/__RangeCheckElimination>
>> <https://wikis.oracle.com/**display/HotSpotInternals/**
>> RangeCheckElimination<https://wikis.oracle.com/display/HotSpotInternals/RangeCheckElimination>
>> >>>
>>
>>
>>
>> As Kris pointed you need to fix your example:
>>
>>
>>
>>
>> --
>> Vitaly
>> 617-548-7007 <tel:617-548-7007> (mobile)
>>
>>
>>
>>
>> --
>> Vitaly
>> 617-548-7007 (mobile)
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-dev/attachments/20120117/033303e0/attachment-0001.html
More information about the hotspot-dev
mailing list