Loop Peeling in Guest Language | Was: Optimization Thresholds?

Wed Jul 16 17:23:48 UTC 2014

Dear all:

Trying to make sense of why the loop peeling worked, I didn’t find much.
On the contrary, it got even more confusing.
The benchmarks in question, did not even split the call targets for the two DirectCallNodes. Because the loop has so many iterations, that compilation triggered before the first iteration was split of.

Taking that into account, I experiment with peeling the loop only on Java level, and it turns out that helps my benchmark as much as the full Truffle splitting helped.
What I mean by that is that I changed from the following implementation of a simple Smalltalk for loop:

```
try {
  for (long i = receiver; i <= limit; i++) {
    valueSend.call(frame, new Object[] {block, i});
  }
} finally {
  if (CompilerDirectives.inInterpreter() && (limit - receiver) > 0) {
     reportLoopCount(limit - receiver);
  }
}
return receiver;
```

To this version first, doing Truffle level peeling:

```
try {
  if (receiver <= limit) {
    firstValueSend.call(frame, new Object[] {block, receiver});  // extra DirectCallNode

    for (long i = receiver; i <= limit; i++) {
      valueSend.call(frame, new Object[] {block, i});
    }
  }
} // ...
```

But since the call target was usually not split, I also tried this:

```
  if (receiver <= limit) {
    valueSend.call(frame, new Object[] {block, receiver});  // same DCN

    for (long i = receiver; i <= limit; i++) {
      valueSend.call(frame, new Object[] {block, i});
    }
  }
```

This still fixes my benchmark. However, it also has the side effect of increasing the warmup time of benchmarks.

Any ideas of what that could be? I presume its something in Graal’s optimizer.

Thanks
Stefan

On 14 Jul 2014, at 14:35, Stefan Marr <java at stefan-marr.de> wrote:

> Hi:
> 
> Turns out, my issue from a month ago wasn’t actually related to the reflective aspects at all and the specializations I put in place to optimize them worked perfectly.
> 
> Instead, the problem turned out to be one that can be solved with loop peeling.
> 
> It seems, for some reason that is not entirely clear to me, the first iteration of the loop is different and that seems to be somehow related to reading an object field.
> Either way, the micro benchmark benefits from an explicitly peeled first loop iteration implemented in the specialization node. That means, I use two DirectCallNodes for the loop body, one for the first iteration, and the second one for the other iterations [1].
> 
> Unfortunately, this form of loop peeling is not generally a win.
> Kalibera et al report that it is useful in R for loops [2].
> However, on my benchmark collection, I see that it reduces performance on average.
> 
> So, I was wondering what your experience with loop peeling on the level of the guest language is.
> Are there perhaps some heuristics that would help to decide whether to use it or not?
> 
> Thanks
> Stefan
> 
> 
> [1] https://github.com/SOM-st/TruffleSOM/blob/master/src/som/interpreter/nodes/specialized/IntToDoMessageNode.java#L58
> [2] http://dl.acm.org/citation.cfm?doid=2576195.2576205
> 
> 
> On 09 Jun 2014, at 22:06, Stefan Marr <java at stefan-marr.de> wrote:
> 
>> Hi:
>> 
>> I am running into a strange issue when optimizing some reflective operations.
>> Don’t think it is related to the operations themselves, but rather the way the optimizations/Graal works.
>> 
>> If benchmarked separately, I see, as desired, the same performance results for direct and reflective operations.
>> So, I assume that my specializations are sufficient to please the optimizer.
>> Concretely, that is reflective method invocation via Smalltalk’s #perform: as well as intercepting undefined methods with #doesNotUnderstand:.
>> 
>> However, if both reflective mechanisms are used in combination to implement something like dynamic proxies, runtime nearly doubles compared to what I would expect.
>> 
>> I have been experimenting with the various thresholds in TruffleCompilerOptions, but without any luck.
>> Since the benchmarks are still microbenchmarks, I don’t think I am really hitting any of those anyway.
>> The fully inlined tree has something like 220 nodes. That’s the number I see in the error output after setting TruffleGraphMaxNodes to a very small number. And, that’s just about 20 more than what I get reported for the non-reflective, i.e., direct benchmark.
>> 
>> What would be a good approach to figure out what’s going wrong here?
>> 
>> Thanks
>> Stefan
>> 
>> To reporduce:
>> 
>> git clone --recursive https://github.com/SOM-st/TruffleSOM.git
>> ant jar
>> ant test
>> ./graal.sh -cp Smalltalk:Examples/Benchmarks/DoesNotUnderstand Examples/Benchmarks/BenchmarkHarness.som DirectAdd 20 0 1000
>> ./graal.sh -cp Smalltalk:Examples/Benchmarks/DoesNotUnderstand Examples/Benchmarks/BenchmarkHarness.som ProxyAdd 20 0 1000
>> 
>> -- 
>> Stefan Marr
>> INRIA Lille - Nord Europe
>> http://stefan-marr.de/research/
>> 
>> 
>> 
> 
> -- 
> Stefan Marr
> INRIA Lille - Nord Europe
> http://stefan-marr.de/research/
> 
> 
> 

-- 
Stefan Marr
INRIA Lille - Nord Europe
http://stefan-marr.de/research/