Detecting range check elimination with PrintAssembly

Manohar Jonnalagedda manojo10386 at gmail.com
Thu Jan 19 03:22:08 PST 2012


Hi Vitaly,

Are you repeatedly seeing ~40% speedup with a compile-time constant?
>

Yes I am indeed. I have a running time of around 7.8 s drop down to around
5.3 when constants are used.


> In the two assembly dumps you posted, the computeNewCentroids seems to
> have some loop unrolling + non-temporal prefetch instructions that I don't
> see (at a cursory glance, albeit) in the 2nd method.  How big is the input
> array for these functions? If it's larger than your highest level cache,
> can you try running the same tests (constant vs non-constant) with a size
> that would fit into L2/L3?
>

The input array is of size 262'144 * 3 ints, which is around 3 mbytes. So
it is definitely already in the L2/L3 range


> What cpu are you running these tests on?
>

I am running the tests on an Intel Xeon X550 :
http://ark.intel.com/products/37106/Intel-Xeon-Processor-X5550-%288M-Cache-2_66-GHz-6_40-GTs-Intel-QPI%29

Thanks,
Manohar


On Mon, Jan 16, 2012 at 4:39 AM, Manohar Jonnalagedda <manojo10386 at gmail.com
> wrote:

> Hi Kris, Vladimir,
>
> thanks for both your responses.
>
> Second, your two test methods are different so you can't directly compare
>> them. method1() iterates over rows using middle loop index 'j' and
>> method2() uses external loop index 'i'. Unless they are typos again.
>>
>
> You are right, these are indeed typos. As Kris suggested, I have the code
> printed here: http://pastebin.com/xRFD1Nt1. The methods corresponding to
> method1, and method2 are constructNearestClusterVector and
> computeNewCentroids. Their PrintOptoAssembly outputs are respectively at
> http://pastebin.com/1evN8b3K and http://pastebin.com/FxkVWTD5
>
> Also, it seems I have not explained myself correctly. I am not trying to
> compare the performance of method1 with respect to that of method2: method1
> and method2 both run in the same program. What I am trying to compare is
> their performance in two cases:
> - when cols is a compile-time constant (much faster)
> - when cols is a value determined at run-time
>
>
>
>> If you are using jdk7 there are few flags you can use to print loop
>> optimizations information. They need debug version of VM but it is not
>> problem for you, I think, since you can use debug PrintOptoAssembly flag.
>>
>> -XX:+TraceLoopOpts prints all happened loop optimizations and loop tree
>> after each round of loop opts,
>> -XX:+TraceLoopPredicate prints RC information when it is moved from a
>> loop,
>> -XX:+TraceRangeLimitCheck prints additional information for RC
>> elimination optimization.
>>
>
> Thanks for these, I will have a look at what they output.
>
> Fourth, range check expression in your example is not what you think. RC
>> expression should be next:
>> (i*stride+offset) where 'i' is loop variable, 'stride' is constant and
>> 'offset' is loop invariant.
>>
>> In your example 'offset' is (j * cols) since it is loop invariant, 'k' is
>> loop variable and stride is '1' (one).
>> In both your methods RC will be moved out of inner loop so the code for
>> it will be the same. The only difference in these methods will be where and
>> how (j * cols) and (i * cols) expressions are calculated.
>>
>
>  I'd guess it's the difference in locality that made the difference in
>>> performance in your two tests.
>>>
>>
>  Thanks for the explanation. I understand from the above that the assembly
> output in both cases mentioned above may not be different, because the
> expressions are similar. The difference in runtime (due to cols being a
> compile-time constant) will be visible elsewhere. Is that right? If so,
> where would I be able to detect this?
>
> Cheers,
> Manohar
>
>
>>  In your PrintOptoAssembly output snippet, the instruction at 0x13e is a
>>> LoadRange, which loads the range from the header
>>> of an array:
>>>
>>> (from x86_64.ad <http://x86_64.ad>)
>>>
>>> // Load Range
>>> instruct loadRange(rRegI dst, memory mem)
>>> %{
>>>   match(Set dst (LoadRange mem));
>>>
>>>   ins_cost(125); // XXX
>>>   format %{ "movl    $dst, $mem\t# range" %}
>>>   opcode(0x8B);
>>>   ins_encode(REX_reg_mem(dst, mem), OpcP, reg_mem(dst, mem));
>>>   ins_pipe(ialu_reg_mem);
>>> %}
>>>
>>> That's not a range check just yet; the real check, if any, should come
>>> after the null check, in the form of comparing
>>> something else with RSI. But you didn't show what's after the null
>>> check, how RSI is used, so it's hard to say what
>>> you're seeing in your example.
>>>
>>> As for the two test examples, could you paste the entire source code,
>>> with the PrintOptoAssembly output of method1() and
>>> method2() ? The first example looks weird, maybe it's a typo but you're
>>> using "j < cols" as the loop condition for the
>>> inner loop.
>>>
>>>
>>
>>> - Kris
>>>
>>> On Mon, Jan 16, 2012 at 1:59 AM, Manohar Jonnalagedda <
>>> manojo10386 at gmail.com <mailto:manojo10386 at gmail.com>> wrote:
>>>
>>>    Hello,
>>>
>>>    following this reference on Range Check Elimination done by the
>>> Hotspot compiler [1], I was keen in knowing how I
>>>    can detect whether range checks are taking place in loops by
>>> inspecting output using the PrintAssembly flag; with
>>>    the old PrintOptoAssembly flag, I have seen output such as the
>>> following, which I assume to be range checks :
>>>
>>>    B11: #  B73 B12 &lt;- B10  Freq: 1.21365
>>>    139     movq    RAX, [rsp + #24]  # spill
>>>    13e     movl    RSI, [RAX + #12 (8-bit)]  # range
>>>    141     NullCheck RAX
>>>
>>>    What is the equivalent with the new PrintAssembly flag (using hsdis)?
>>>
>>>    Moreover, as stated on the wiki page [1], loops are optimized if the
>>> stride is a compile-time constant. I performed
>>>    a few tests on a kmeans program, with 3 nested loops, having the
>>> following (high-level) structure:
>>>
>>>    ===
>>>    void method1(){
>>>    //loop 1
>>>    for(int i = 0; i< rows1; i++){
>>>      //...
>>>       for(int j = 0; j< rows2; j++){
>>>       //...
>>>        for(int k = 0; j < cols; k++){ array[j * cols + k] = //...}
>>>       }
>>>    }
>>>    }
>>>
>>>    void method2(){
>>>    //loop 2
>>>    for(int i =0; i < rows1; i++){
>>>       for(int j=0 ; i< rows2; j++){
>>>         for(int k=0 ; k< cols; k++){
>>>           array[i*cols+k] = //...
>>>         }
>>>       }
>>>    }
>>>    }
>>>
>>>    void main(){
>>>
>>>      do{
>>>        method1(); method2();
>>>      }while(!converged)
>>>
>>>    }
>>>    ====
>>>
>>>    In the first test, cols is an int whose value is determined at
>>> runtime (by reading a file), in the second test, it
>>>    is given as a compile-time constant(3). In the second test, there is
>>> a */significant*/ speed-up (around 40%).
>>>
>>>    However, when studying the diff of the output of PrintOptoAssembly
>>> for both method1 and method2, there is no
>>>    difference (apart from slight value changes in frequency). Would you
>>> have any hints as to where I could look for
>>>    differences?
>>>
>>>    Thanks a lot,
>>>    Manohar
>>>
>>>    [1]
>>> https://wikis.oracle.com/display/HotSpotInternals/RangeCheckElimination
>>>
>>>
>>>
>> As Kris pointed you need to fix your example:
>
>


-- 
> Vitaly
> 617-548-7007 (mobile)
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-dev/attachments/20120119/860bfc12/attachment-0001.html 


More information about the hotspot-dev mailing list