Detecting range check elimination with PrintAssembly
Manohar Jonnalagedda
manojo10386 at gmail.com
Thu Jan 19 03:22:08 PST 2012
Hi Vitaly,
Are you repeatedly seeing ~40% speedup with a compile-time constant?
>
Yes I am indeed. I have a running time of around 7.8 s drop down to around
5.3 when constants are used.
> In the two assembly dumps you posted, the computeNewCentroids seems to
> have some loop unrolling + non-temporal prefetch instructions that I don't
> see (at a cursory glance, albeit) in the 2nd method. How big is the input
> array for these functions? If it's larger than your highest level cache,
> can you try running the same tests (constant vs non-constant) with a size
> that would fit into L2/L3?
>
The input array is of size 262'144 * 3 ints, which is around 3 mbytes. So
it is definitely already in the L2/L3 range
> What cpu are you running these tests on?
>
I am running the tests on an Intel Xeon X550 :
http://ark.intel.com/products/37106/Intel-Xeon-Processor-X5550-%288M-Cache-2_66-GHz-6_40-GTs-Intel-QPI%29
Thanks,
Manohar
On Mon, Jan 16, 2012 at 4:39 AM, Manohar Jonnalagedda <manojo10386 at gmail.com
> wrote:
> Hi Kris, Vladimir,
>
> thanks for both your responses.
>
> Second, your two test methods are different so you can't directly compare
>> them. method1() iterates over rows using middle loop index 'j' and
>> method2() uses external loop index 'i'. Unless they are typos again.
>>
>
> You are right, these are indeed typos. As Kris suggested, I have the code
> printed here: http://pastebin.com/xRFD1Nt1. The methods corresponding to
> method1, and method2 are constructNearestClusterVector and
> computeNewCentroids. Their PrintOptoAssembly outputs are respectively at
> http://pastebin.com/1evN8b3K and http://pastebin.com/FxkVWTD5
>
> Also, it seems I have not explained myself correctly. I am not trying to
> compare the performance of method1 with respect to that of method2: method1
> and method2 both run in the same program. What I am trying to compare is
> their performance in two cases:
> - when cols is a compile-time constant (much faster)
> - when cols is a value determined at run-time
>
>
>
>> If you are using jdk7 there are few flags you can use to print loop
>> optimizations information. They need debug version of VM but it is not
>> problem for you, I think, since you can use debug PrintOptoAssembly flag.
>>
>> -XX:+TraceLoopOpts prints all happened loop optimizations and loop tree
>> after each round of loop opts,
>> -XX:+TraceLoopPredicate prints RC information when it is moved from a
>> loop,
>> -XX:+TraceRangeLimitCheck prints additional information for RC
>> elimination optimization.
>>
>
> Thanks for these, I will have a look at what they output.
>
> Fourth, range check expression in your example is not what you think. RC
>> expression should be next:
>> (i*stride+offset) where 'i' is loop variable, 'stride' is constant and
>> 'offset' is loop invariant.
>>
>> In your example 'offset' is (j * cols) since it is loop invariant, 'k' is
>> loop variable and stride is '1' (one).
>> In both your methods RC will be moved out of inner loop so the code for
>> it will be the same. The only difference in these methods will be where and
>> how (j * cols) and (i * cols) expressions are calculated.
>>
>
> I'd guess it's the difference in locality that made the difference in
>>> performance in your two tests.
>>>
>>
> Thanks for the explanation. I understand from the above that the assembly
> output in both cases mentioned above may not be different, because the
> expressions are similar. The difference in runtime (due to cols being a
> compile-time constant) will be visible elsewhere. Is that right? If so,
> where would I be able to detect this?
>
> Cheers,
> Manohar
>
>
>> In your PrintOptoAssembly output snippet, the instruction at 0x13e is a
>>> LoadRange, which loads the range from the header
>>> of an array:
>>>
>>> (from x86_64.ad <http://x86_64.ad>)
>>>
>>> // Load Range
>>> instruct loadRange(rRegI dst, memory mem)
>>> %{
>>> match(Set dst (LoadRange mem));
>>>
>>> ins_cost(125); // XXX
>>> format %{ "movl $dst, $mem\t# range" %}
>>> opcode(0x8B);
>>> ins_encode(REX_reg_mem(dst, mem), OpcP, reg_mem(dst, mem));
>>> ins_pipe(ialu_reg_mem);
>>> %}
>>>
>>> That's not a range check just yet; the real check, if any, should come
>>> after the null check, in the form of comparing
>>> something else with RSI. But you didn't show what's after the null
>>> check, how RSI is used, so it's hard to say what
>>> you're seeing in your example.
>>>
>>> As for the two test examples, could you paste the entire source code,
>>> with the PrintOptoAssembly output of method1() and
>>> method2() ? The first example looks weird, maybe it's a typo but you're
>>> using "j < cols" as the loop condition for the
>>> inner loop.
>>>
>>>
>>
>>> - Kris
>>>
>>> On Mon, Jan 16, 2012 at 1:59 AM, Manohar Jonnalagedda <
>>> manojo10386 at gmail.com <mailto:manojo10386 at gmail.com>> wrote:
>>>
>>> Hello,
>>>
>>> following this reference on Range Check Elimination done by the
>>> Hotspot compiler [1], I was keen in knowing how I
>>> can detect whether range checks are taking place in loops by
>>> inspecting output using the PrintAssembly flag; with
>>> the old PrintOptoAssembly flag, I have seen output such as the
>>> following, which I assume to be range checks :
>>>
>>> B11: # B73 B12 <- B10 Freq: 1.21365
>>> 139 movq RAX, [rsp + #24] # spill
>>> 13e movl RSI, [RAX + #12 (8-bit)] # range
>>> 141 NullCheck RAX
>>>
>>> What is the equivalent with the new PrintAssembly flag (using hsdis)?
>>>
>>> Moreover, as stated on the wiki page [1], loops are optimized if the
>>> stride is a compile-time constant. I performed
>>> a few tests on a kmeans program, with 3 nested loops, having the
>>> following (high-level) structure:
>>>
>>> ===
>>> void method1(){
>>> //loop 1
>>> for(int i = 0; i< rows1; i++){
>>> //...
>>> for(int j = 0; j< rows2; j++){
>>> //...
>>> for(int k = 0; j < cols; k++){ array[j * cols + k] = //...}
>>> }
>>> }
>>> }
>>>
>>> void method2(){
>>> //loop 2
>>> for(int i =0; i < rows1; i++){
>>> for(int j=0 ; i< rows2; j++){
>>> for(int k=0 ; k< cols; k++){
>>> array[i*cols+k] = //...
>>> }
>>> }
>>> }
>>> }
>>>
>>> void main(){
>>>
>>> do{
>>> method1(); method2();
>>> }while(!converged)
>>>
>>> }
>>> ====
>>>
>>> In the first test, cols is an int whose value is determined at
>>> runtime (by reading a file), in the second test, it
>>> is given as a compile-time constant(3). In the second test, there is
>>> a */significant*/ speed-up (around 40%).
>>>
>>> However, when studying the diff of the output of PrintOptoAssembly
>>> for both method1 and method2, there is no
>>> difference (apart from slight value changes in frequency). Would you
>>> have any hints as to where I could look for
>>> differences?
>>>
>>> Thanks a lot,
>>> Manohar
>>>
>>> [1]
>>> https://wikis.oracle.com/display/HotSpotInternals/RangeCheckElimination
>>>
>>>
>>>
>> As Kris pointed you need to fix your example:
>
>
--
> Vitaly
> 617-548-7007 (mobile)
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.openjdk.java.net/pipermail/hotspot-dev/attachments/20120119/860bfc12/attachment-0001.html
More information about the hotspot-dev
mailing list