Loop Peeling in Guest Language | Was: Optimization Thresholds?

Wed Jul 16 18:18:33 UTC 2014

Basically I’d guess some test is becoming loop invariant that we aren’t otherwise able to make invariant.  Or maybe some type information becomes more precise.

Peeling can have several benefits when dealing with loops.  It can take a test inside the loop and create a copy outside the loop that may eliminates the test in the loop. It can also have effects similar to loop inversion in that you now have a block which is entered if and only if you are entering the loop itself which is handy for several reasons.  Loop inversion is something we will be adding in the next month of so.

tom

On Jul 16, 2014, at 10:23 AM, Stefan Marr <java at stefan-marr.de> wrote:

> Dear all:
> 
> Trying to make sense of why the loop peeling worked, I didn’t find much.
> On the contrary, it got even more confusing.
> The benchmarks in question, did not even split the call targets for the two DirectCallNodes. Because the loop has so many iterations, that compilation triggered before the first iteration was split of.
> 
> Taking that into account, I experiment with peeling the loop only on Java level, and it turns out that helps my benchmark as much as the full Truffle splitting helped.
> What I mean by that is that I changed from the following implementation of a simple Smalltalk for loop:
> 
> ```
> try {
>  for (long i = receiver; i <= limit; i++) {
>    valueSend.call(frame, new Object[] {block, i});
>  }
> } finally {
>  if (CompilerDirectives.inInterpreter() && (limit - receiver) > 0) {
>     reportLoopCount(limit - receiver);
>  }
> }
> return receiver;
> ```
> 
> To this version first, doing Truffle level peeling:
> 
> ```
> try {
>  if (receiver <= limit) {
>    firstValueSend.call(frame, new Object[] {block, receiver});  // extra DirectCallNode
> 
>    for (long i = receiver; i <= limit; i++) {
>      valueSend.call(frame, new Object[] {block, i});
>    }
>  }
> } // ...
> ```
> 
> But since the call target was usually not split, I also tried this:
> 
> ```
>  if (receiver <= limit) {
>    valueSend.call(frame, new Object[] {block, receiver});  // same DCN
> 
>    for (long i = receiver; i <= limit; i++) {
>      valueSend.call(frame, new Object[] {block, i});
>    }
>  }
> ```
> 
> This still fixes my benchmark. However, it also has the side effect of increasing the warmup time of benchmarks.
> 
> Any ideas of what that could be? I presume its something in Graal’s optimizer.
> 
> Thanks
> Stefan
> 
> 
> On 14 Jul 2014, at 14:35, Stefan Marr <java at stefan-marr.de> wrote:
> 
>> Hi:
>> 
>> Turns out, my issue from a month ago wasn’t actually related to the reflective aspects at all and the specializations I put in place to optimize them worked perfectly.
>> 
>> Instead, the problem turned out to be one that can be solved with loop peeling.
>> 
>> It seems, for some reason that is not entirely clear to me, the first iteration of the loop is different and that seems to be somehow related to reading an object field.
>> Either way, the micro benchmark benefits from an explicitly peeled first loop iteration implemented in the specialization node. That means, I use two DirectCallNodes for the loop body, one for the first iteration, and the second one for the other iterations [1].
>> 
>> Unfortunately, this form of loop peeling is not generally a win.
>> Kalibera et al report that it is useful in R for loops [2].
>> However, on my benchmark collection, I see that it reduces performance on average.
>> 
>> So, I was wondering what your experience with loop peeling on the level of the guest language is.
>> Are there perhaps some heuristics that would help to decide whether to use it or not?
>> 
>> Thanks
>> Stefan
>> 
>> 
>> [1] https://github.com/SOM-st/TruffleSOM/blob/master/src/som/interpreter/nodes/specialized/IntToDoMessageNode.java#L58
>> [2] http://dl.acm.org/citation.cfm?doid=2576195.2576205
>> 
>> 
>> On 09 Jun 2014, at 22:06, Stefan Marr <java at stefan-marr.de> wrote:
>> 
>>> Hi:
>>> 
>>> I am running into a strange issue when optimizing some reflective operations.
>>> Don’t think it is related to the operations themselves, but rather the way the optimizations/Graal works.
>>> 
>>> If benchmarked separately, I see, as desired, the same performance results for direct and reflective operations.
>>> So, I assume that my specializations are sufficient to please the optimizer.
>>> Concretely, that is reflective method invocation via Smalltalk’s #perform: as well as intercepting undefined methods with #doesNotUnderstand:.
>>> 
>>> However, if both reflective mechanisms are used in combination to implement something like dynamic proxies, runtime nearly doubles compared to what I would expect.
>>> 
>>> I have been experimenting with the various thresholds in TruffleCompilerOptions, but without any luck.
>>> Since the benchmarks are still microbenchmarks, I don’t think I am really hitting any of those anyway.
>>> The fully inlined tree has something like 220 nodes. That’s the number I see in the error output after setting TruffleGraphMaxNodes to a very small number. And, that’s just about 20 more than what I get reported for the non-reflective, i.e., direct benchmark.
>>> 
>>> What would be a good approach to figure out what’s going wrong here?
>>> 
>>> Thanks
>>> Stefan
>>> 
>>> To reporduce:
>>> 
>>> git clone --recursive https://github.com/SOM-st/TruffleSOM.git
>>> ant jar
>>> ant test
>>> ./graal.sh -cp Smalltalk:Examples/Benchmarks/DoesNotUnderstand Examples/Benchmarks/BenchmarkHarness.som DirectAdd 20 0 1000
>>> ./graal.sh -cp Smalltalk:Examples/Benchmarks/DoesNotUnderstand Examples/Benchmarks/BenchmarkHarness.som ProxyAdd 20 0 1000
>>> 
>>> -- 
>>> Stefan Marr
>>> INRIA Lille - Nord Europe
>>> http://stefan-marr.de/research/
>>> 
>>> 
>>> 
>> 
>> -- 
>> Stefan Marr
>> INRIA Lille - Nord Europe
>> http://stefan-marr.de/research/
>> 
>> 
>> 
> 
> -- 
> Stefan Marr
> INRIA Lille - Nord Europe
> http://stefan-marr.de/research/
> 
> 
>