Detecting range check elimination with PrintAssembly
Vladimir Kozlov
vladimir.kozlov at oracle.com
Mon Jan 16 15:45:14 PST 2012
Vitaly,
We do use profile_trip_cnt during loop unroll calculation but not during fully unroll because we can't trust it 100%
since program's phase and number of iterations could change after method is compiled. See policy_unroll() and
policy_maximally_unroll():
http://hg.openjdk.java.net/hsx/hotspot-comp/hotspot/file/89d0a5d40008/src/share/vm/opto/loopTransform.cpp
We could use deopt as you suggested but deoptimization is double-edge sword, when method recompiled after deoptimization
some aggressive optimizations will not be executed for it so the new generated code could be slower.
Regards,
Vladimir
On 1/16/12 2:37 PM, Vitaly Davidovich wrote:
> Hi Vladimir,
>
> If x_col is always seen to be same value in the profile shouldn't the loop be unrolled as well with some deopt guard? Or
> does this not participate in profiling?
>
> Thanks
>
> On Jan 16, 2012 4:57 PM, "Vladimir Kozlov" <vladimir.kozlov at oracle.com <mailto:vladimir.kozlov at oracle.com>> wrote:
>
> > be different, because the expressions are similar. The difference in runtime (due to cols being a compile-time
> > constant) will be visible elsewhere. Is that right? If so, where would I be able to detect this?
>
> In such situations we usually use some visual tools to see difference between log outputs. At least you can use
> 'diff'. You may need to replace instructions addresses in outputs (number at the beginning of lines) with the same
> value to match. There are few tricks you may use to get similar PrintOptoAssembly output. Use next flags to avoid
> mixing output from program output and from 2 compiler threads (flags stop program until a method is compiled and run
> only one compiler thread):
>
> -Xbatch -XX:CICompilerCount=1
>
> Also add -XX:+PrintCompilation -XX:+PrintInlining to see what method is compiled and inlined. Note that you may see
> similar output for individual methods but could be big difference in compiled caller (computeAll()) method where 2
> loop methods could be inlined. So you need to compare all compiled methods.
>
> In general, to have constant as loop limit is always win because some checks in generated code could be avoided and
> more optimizations could be done for such loops. Use -XX:+TraceLoopOpts to see what loop optimizations are done in
> both cases.
>
> For example, in your code you set 'x_col = 3', as result the next loop in constructNearestClusterVector(__) will be
> fully unrolled when this method is inlined into computeAll() and x_col is replaced with '3':
>
> for(k = 0; k < x_col; k++) {
> double tmp = x[i*x_col + k] - mu[j* mu_col + k];
> dist += tmp * tmp;
> }
>
> Vladimir
>
> On 1/16/12 1:39 AM, Manohar Jonnalagedda wrote:
>
> Hi Kris, Vladimir,
>
> thanks for both your responses.
>
> Second, your two test methods are different so you can't directly compare them. method1() iterates over rows
> using
> middle loop index 'j' and method2() uses external loop index 'i'. Unless they are typos again.
>
>
> You are right, these are indeed typos. As Kris suggested, I have the code printed here:
> http://pastebin.com/xRFD1Nt1.
> The methods corresponding to method1, and method2 are constructNearestClusterVector and computeNewCentroids. Their
> PrintOptoAssembly outputs are respectively at http://pastebin.com/1evN8b3K and http://pastebin.com/FxkVWTD5
>
> Also, it seems I have not explained myself correctly. I am not trying to compare the performance of method1 with
> respect
> to that of method2: method1 and method2 both run in the same program. What I am trying to compare is their
> performance
> in two cases:
> - when cols is a compile-time constant (much faster)
> - when cols is a value determined at run-time
>
> If you are using jdk7 there are few flags you can use to print loop optimizations information. They need debug
> version of VM but it is not problem for you, I think, since you can use debug PrintOptoAssembly flag.
>
> -XX:+TraceLoopOpts prints all happened loop optimizations and loop tree after each round of loop opts,
> -XX:+TraceLoopPredicate prints RC information when it is moved from a loop,
> -XX:+TraceRangeLimitCheck prints additional information for RC elimination optimization.
>
>
> Thanks for these, I will have a look at what they output.
>
> Fourth, range check expression in your example is not what you think. RC expression should be next:
> (i*stride+offset) where 'i' is loop variable, 'stride' is constant and 'offset' is loop invariant.
>
> In your example 'offset' is (j * cols) since it is loop invariant, 'k' is loop variable and stride is '1' (one).
> In both your methods RC will be moved out of inner loop so the code for it will be the same. The only
> difference in
> these methods will be where and how (j * cols) and (i * cols) expressions are calculated.
>
>
> I'd guess it's the difference in locality that made the difference in performance in your two tests.
>
> Thanks for the explanation. I understand from the above that the assembly output in both cases mentioned above
> may not
> be different, because the expressions are similar. The difference in runtime (due to cols being a compile-time
> constant)
> will be visible elsewhere. Is that right? If so, where would I be able to detect this?
>
> Cheers,
> Manohar
>
> In your PrintOptoAssembly output snippet, the instruction at 0x13e is a LoadRange, which loads the range
> from
> the header
> of an array:
>
> (from x86_64.ad <http://x86_64.ad> <http://x86_64.ad> <http://x86_64.ad>)
>
> // Load Range
> instruct loadRange(rRegI dst, memory mem)
> %{
> match(Set dst (LoadRange mem));
>
> ins_cost(125); // XXX
> format %{ "movl $dst, $mem\t# range" %}
> opcode(0x8B);
> ins_encode(REX_reg_mem(dst, mem), OpcP, reg_mem(dst, mem));
> ins_pipe(ialu_reg_mem);
> %}
>
> That's not a range check just yet; the real check, if any, should come after the null check, in the form of
> comparing
> something else with RSI. But you didn't show what's after the null check, how RSI is used, so it's hard
> to say what
> you're seeing in your example.
>
> As for the two test examples, could you paste the entire source code, with the PrintOptoAssembly output of
> method1() and
> method2() ? The first example looks weird, maybe it's a typo but you're using "j < cols" as the loop
> condition
> for the
> inner loop.
>
>
> - Kris
>
> On Mon, Jan 16, 2012 at 1:59 AM, Manohar Jonnalagedda <manojo10386 at gmail.com
> <mailto:manojo10386 at gmail.com> <mailto:manojo10386 at gmail.com <mailto:manojo10386 at gmail.com>>
> <mailto:manojo10386 at gmail.com <mailto:manojo10386 at gmail.com> <mailto:manojo10386 at gmail.com
> <mailto:manojo10386 at gmail.com>>__>> wrote:
>
> Hello,
>
> following this reference on Range Check Elimination done by the Hotspot compiler [1], I was keen in
> knowing
> how I
> can detect whether range checks are taking place in loops by inspecting output using the
> PrintAssembly flag;
> with
> the old PrintOptoAssembly flag, I have seen output such as the following, which I assume to be range
> checks :
>
> B11: # B73 B12 <- B10 Freq: 1.21365
> 139 movq RAX, [rsp + #24] # spill
> 13e movl RSI, [RAX + #12 (8-bit)] # range
> 141 NullCheck RAX
>
> What is the equivalent with the new PrintAssembly flag (using hsdis)?
>
> Moreover, as stated on the wiki page [1], loops are optimized if the stride is a compile-time
> constant. I
> performed
> a few tests on a kmeans program, with 3 nested loops, having the following (high-level) structure:
>
> ===
> void method1(){
> //loop 1
> for(int i = 0; i< rows1; i++){
> //...
> for(int j = 0; j< rows2; j++){
> //...
> for(int k = 0; j < cols; k++){ array[j * cols + k] = //...}
> }
> }
> }
>
> void method2(){
> //loop 2
> for(int i =0; i < rows1; i++){
> for(int j=0 ; i< rows2; j++){
> for(int k=0 ; k< cols; k++){
> array[i*cols+k] = //...
> }
> }
> }
> }
>
> void main(){
>
> do{
> method1(); method2();
> }while(!converged)
>
> }
> ====
>
> In the first test, cols is an int whose value is determined at runtime (by reading a file), in the
> second
> test, it
> is given as a compile-time constant(3). In the second test, there is a */significant*/ speed-up
> (around 40%).
>
> However, when studying the diff of the output of PrintOptoAssembly for both method1 and method2,
> there is no
> difference (apart from slight value changes in frequency). Would you have any hints as to where I
> could look for
> differences?
>
> Thanks a lot,
> Manohar
>
> [1] https://wikis.oracle.com/__display/HotSpotInternals/__RangeCheckElimination
> <https://wikis.oracle.com/display/HotSpotInternals/RangeCheckElimination>
>
>
>
> As Kris pointed you need to fix your example:
>
More information about the hotspot-dev
mailing list