Detecting range check elimination with PrintAssembly
Vitaly Davidovich
vitalyd at gmail.com
Mon Jan 16 16:15:18 PST 2012
Vladimir, thanks for the explanation and the code pointer.
Intuitively, it would seem like a good idea to trust the profile 100% if it
reports the same value used 100% of the time (I can see how anything less
than 100%, even a very high probability of same value, is not trustworthy)
given sufficient trips through the loop. Although I can see how an app may
have phases where same value is seen for a while before it's switched, but
that's where I thought deopt would help. There must be a good chunk of
code out there that doesn't know at static compilation time the loop count
(so can't use compile-time constant), but at runtime the actual value
doesn't change for many many trips through the loop; I know I have code
like that in various places.
What's the reason a compilation after deopt would not be as aggressive as
the 1st time? Is it because the profile information may be "weaker" (i.e.
more uncertainty in it)? I thought the profile is completely reset after
deopt, so I would think if the loop is now executed with a different
"constant" value (e.g. in our example, instead of 3 it's now 4), then the
same optimizations will be applied (of course if unrolling the loop is no
longer advantageous due to a much different value, I can see how different
optimizations will be applied).
Thanks
On Mon, Jan 16, 2012 at 6:45 PM, Vladimir Kozlov <vladimir.kozlov at oracle.com
> wrote:
> Vitaly,
>
> We do use profile_trip_cnt during loop unroll calculation but not during
> fully unroll because we can't trust it 100% since program's phase and
> number of iterations could change after method is compiled. See
> policy_unroll() and policy_maximally_unroll():
>
> http://hg.openjdk.java.net/**hsx/hotspot-comp/hotspot/file/**
> 89d0a5d40008/src/share/vm/**opto/loopTransform.cpp<http://hg.openjdk.java.net/hsx/hotspot-comp/hotspot/file/89d0a5d40008/src/share/vm/opto/loopTransform.cpp>
>
> We could use deopt as you suggested but deoptimization is double-edge
> sword, when method recompiled after deoptimization some aggressive
> optimizations will not be executed for it so the new generated code could
> be slower.
>
> Regards,
> Vladimir
>
>
> On 1/16/12 2:37 PM, Vitaly Davidovich wrote:
>
>> Hi Vladimir,
>>
>> If x_col is always seen to be same value in the profile shouldn't the
>> loop be unrolled as well with some deopt guard? Or
>> does this not participate in profiling?
>>
>> Thanks
>>
>> On Jan 16, 2012 4:57 PM, "Vladimir Kozlov" <vladimir.kozlov at oracle.com<mailto:
>> vladimir.kozlov@**oracle.com <vladimir.kozlov at oracle.com>>> wrote:
>>
>> > be different, because the expressions are similar. The difference
>> in runtime (due to cols being a compile-time
>> > constant) will be visible elsewhere. Is that right? If so, where
>> would I be able to detect this?
>>
>> In such situations we usually use some visual tools to see difference
>> between log outputs. At least you can use
>> 'diff'. You may need to replace instructions addresses in outputs
>> (number at the beginning of lines) with the same
>> value to match. There are few tricks you may use to get similar
>> PrintOptoAssembly output. Use next flags to avoid
>> mixing output from program output and from 2 compiler threads (flags
>> stop program until a method is compiled and run
>> only one compiler thread):
>>
>> -Xbatch -XX:CICompilerCount=1
>>
>> Also add -XX:+PrintCompilation -XX:+PrintInlining to see what method
>> is compiled and inlined. Note that you may see
>> similar output for individual methods but could be big difference in
>> compiled caller (computeAll()) method where 2
>> loop methods could be inlined. So you need to compare all compiled
>> methods.
>>
>> In general, to have constant as loop limit is always win because some
>> checks in generated code could be avoided and
>> more optimizations could be done for such loops. Use
>> -XX:+TraceLoopOpts to see what loop optimizations are done in
>> both cases.
>>
>> For example, in your code you set 'x_col = 3', as result the next loop
>> in constructNearestClusterVector(**__) will be
>>
>> fully unrolled when this method is inlined into computeAll() and x_col
>> is replaced with '3':
>>
>> for(k = 0; k < x_col; k++) {
>> double tmp = x[i*x_col + k] - mu[j* mu_col + k];
>> dist += tmp * tmp;
>> }
>>
>> Vladimir
>>
>> On 1/16/12 1:39 AM, Manohar Jonnalagedda wrote:
>>
>> Hi Kris, Vladimir,
>>
>> thanks for both your responses.
>>
>> Second, your two test methods are different so you can't
>> directly compare them. method1() iterates over rows
>> using
>> middle loop index 'j' and method2() uses external loop index
>> 'i'. Unless they are typos again.
>>
>>
>> You are right, these are indeed typos. As Kris suggested, I have
>> the code printed here:
>> http://pastebin.com/xRFD1Nt1.
>> The methods corresponding to method1, and method2 are
>> constructNearestClusterVector and computeNewCentroids. Their
>> PrintOptoAssembly outputs are respectively at
>> http://pastebin.com/1evN8b3K and http://pastebin.com/FxkVWTD5
>>
>> Also, it seems I have not explained myself correctly. I am not
>> trying to compare the performance of method1 with
>> respect
>> to that of method2: method1 and method2 both run in the same
>> program. What I am trying to compare is their
>> performance
>> in two cases:
>> - when cols is a compile-time constant (much faster)
>> - when cols is a value determined at run-time
>>
>> If you are using jdk7 there are few flags you can use to print
>> loop optimizations information. They need debug
>> version of VM but it is not problem for you, I think, since
>> you can use debug PrintOptoAssembly flag.
>>
>> -XX:+TraceLoopOpts prints all happened loop optimizations and
>> loop tree after each round of loop opts,
>> -XX:+TraceLoopPredicate prints RC information when it is moved
>> from a loop,
>> -XX:+TraceRangeLimitCheck prints additional information for RC
>> elimination optimization.
>>
>>
>> Thanks for these, I will have a look at what they output.
>>
>> Fourth, range check expression in your example is not what you
>> think. RC expression should be next:
>> (i*stride+offset) where 'i' is loop variable, 'stride' is
>> constant and 'offset' is loop invariant.
>>
>> In your example 'offset' is (j * cols) since it is loop
>> invariant, 'k' is loop variable and stride is '1' (one).
>> In both your methods RC will be moved out of inner loop so the
>> code for it will be the same. The only
>> difference in
>> these methods will be where and how (j * cols) and (i * cols)
>> expressions are calculated.
>>
>>
>> I'd guess it's the difference in locality that made the
>> difference in performance in your two tests.
>>
>> Thanks for the explanation. I understand from the above that the
>> assembly output in both cases mentioned above
>> may not
>> be different, because the expressions are similar. The difference
>> in runtime (due to cols being a compile-time
>> constant)
>> will be visible elsewhere. Is that right? If so, where would I be
>> able to detect this?
>>
>> Cheers,
>> Manohar
>>
>> In your PrintOptoAssembly output snippet, the instruction
>> at 0x13e is a LoadRange, which loads the range
>> from
>> the header
>> of an array:
>>
>> (from x86_64.ad <http://x86_64.ad> <http://x86_64.ad> <
>> http://x86_64.ad>)
>>
>>
>> // Load Range
>> instruct loadRange(rRegI dst, memory mem)
>> %{
>> match(Set dst (LoadRange mem));
>>
>> ins_cost(125); // XXX
>> format %{ "movl $dst, $mem\t# range" %}
>> opcode(0x8B);
>> ins_encode(REX_reg_mem(dst, mem), OpcP, reg_mem(dst,
>> mem));
>> ins_pipe(ialu_reg_mem);
>> %}
>>
>> That's not a range check just yet; the real check, if any,
>> should come after the null check, in the form of
>> comparing
>> something else with RSI. But you didn't show what's after
>> the null check, how RSI is used, so it's hard
>> to say what
>> you're seeing in your example.
>>
>> As for the two test examples, could you paste the entire
>> source code, with the PrintOptoAssembly output of
>> method1() and
>> method2() ? The first example looks weird, maybe it's a
>> typo but you're using "j < cols" as the loop
>> condition
>> for the
>> inner loop.
>>
>>
>> - Kris
>>
>> On Mon, Jan 16, 2012 at 1:59 AM, Manohar Jonnalagedda <
>> manojo10386 at gmail.com
>> <mailto:manojo10386 at gmail.com> <mailto:manojo10386 at gmail.com<mailto:
>> manojo10386 at gmail.com>**>
>> <mailto:manojo10386 at gmail.com <mailto:manojo10386 at gmail.com>
>> <mailto:manojo10386 at gmail.com
>> <mailto:manojo10386 at gmail.com>**>__>> wrote:
>>
>> Hello,
>>
>> following this reference on Range Check Elimination
>> done by the Hotspot compiler [1], I was keen in
>> knowing
>> how I
>> can detect whether range checks are taking place in
>> loops by inspecting output using the
>> PrintAssembly flag;
>> with
>> the old PrintOptoAssembly flag, I have seen output
>> such as the following, which I assume to be range
>> checks :
>>
>> B11: # B73 B12 <- B10 Freq: 1.21365
>> 139 movq RAX, [rsp + #24] # spill
>> 13e movl RSI, [RAX + #12 (8-bit)] # range
>> 141 NullCheck RAX
>>
>> What is the equivalent with the new PrintAssembly flag
>> (using hsdis)?
>>
>> Moreover, as stated on the wiki page [1], loops are
>> optimized if the stride is a compile-time
>> constant. I
>> performed
>> a few tests on a kmeans program, with 3 nested loops,
>> having the following (high-level) structure:
>>
>> ===
>> void method1(){
>> //loop 1
>> for(int i = 0; i< rows1; i++){
>> //...
>> for(int j = 0; j< rows2; j++){
>> //...
>> for(int k = 0; j < cols; k++){ array[j * cols + k]
>> = //...}
>> }
>> }
>> }
>>
>> void method2(){
>> //loop 2
>> for(int i =0; i < rows1; i++){
>> for(int j=0 ; i< rows2; j++){
>> for(int k=0 ; k< cols; k++){
>> array[i*cols+k] = //...
>> }
>> }
>> }
>> }
>>
>> void main(){
>>
>> do{
>> method1(); method2();
>> }while(!converged)
>>
>> }
>> ====
>>
>> In the first test, cols is an int whose value is
>> determined at runtime (by reading a file), in the
>> second
>> test, it
>> is given as a compile-time constant(3). In the second
>> test, there is a */significant*/ speed-up
>> (around 40%).
>>
>> However, when studying the diff of the output of
>> PrintOptoAssembly for both method1 and method2,
>> there is no
>> difference (apart from slight value changes in
>> frequency). Would you have any hints as to where I
>> could look for
>> differences?
>>
>> Thanks a lot,
>> Manohar
>>
>> [1] https://wikis.oracle.com/__**
>> display/HotSpotInternals/__**RangeCheckElimination<https://wikis.oracle.com/__display/HotSpotInternals/__RangeCheckElimination>
>>
>> <https://wikis.oracle.com/**display/HotSpotInternals/**
>> RangeCheckElimination<https://wikis.oracle.com/display/HotSpotInternals/RangeCheckElimination>
>> >
>>
>>
>>
>> As Kris pointed you need to fix your example:
>>
>>
--
Vitaly
617-548-7007 (mobile)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-dev/attachments/20120116/cb2bce5a/attachment-0001.html
More information about the hotspot-dev
mailing list