A loop optimization sometimes results in pessimization

Vladimir Kozlov vladimir.kozlov at oracle.com
Tue Feb 18 12:05:28 PST 2014


Yes, it is know problem:

https://bugs.openjdk.java.net/browse/JDK-7101232

"LCM place all loads at the beginning of the loop and then use loaded 
values. And loaded values are spilled on stack since there are no 
registers enough"

I hope we will have time to look this soon.

Thanks,
Vladimir

On 2/18/14 3:29 AM, Marko Topolnik wrote:
> I have found that the following code:
>
>      for (int i = 0; i < array.length; i++) {
>        final int entry = array[i];
>        result ^= i + array[i];
>        if (entry == 0) break;
>      }
>
> executes slower if I comment out the last line
>
>        if (entry == 0) break;
>
> I have ensured that the array has no zero values.
>
> After some analysis, I have isolated the cause to be an additional transformation which is performed in the latter case, when the loop has no premature exit points (subject to the elimination of array bounds check, which does happen in my case). The transformation collates the operations in the unrolled loop, so that array access for all unrolled steps is performed first, followed by all the XORing into the accumulator.
>
> The slowdown occurs because there is a lot of temporary storage needed to remember all the results of memory reads until they are used for XOR-ing. Some of the reads overflow onto the stack.
>
> In detail, I have gathered the following table which correlates execution times with the -XX:LoopUnrollLimit command-line argument:
>
> LoopUnrollLimit:   14 15 18 19 22 23 60
> withExitPoint:     96 95 95 79 80 80 69   1/100 ns
> withoutExitPoint:  94 64 64 63 64 77 75   1/100 ns
>
> The following sudden changes can be observed:
>
> - on the transition from 14 to 15, the withoutExitPoint variant receives a beneficial transformation and is significantly sped up;
>
> - on 18->19, the withExitPoint variant receives a speedup, lesser than the above;
>
> - on 22->23, the withoutExitPoint variant is slowed down. At this point I see the above described collation starting to happen.
>
> The default loopUnrollLimit for my setup is 60, so I present its results in the last column. The withExitPoint variant is faster, even though it has one more line of code.
>
>
> Is this a known effect? Would it be worth fixing?
>
>
> -Marko Topolnik
>
>
> P.S. This is the JMH code I have used for measurements:
>
> @OutputTimeUnit(TimeUnit.NANOSECONDS)
> @BenchmarkMode(Mode.AverageTime)
> @OperationsPerInvocation(Measure.ARRAY_SIZE)
> @Warmup(iterations = 20, time = 200, timeUnit=MILLISECONDS)
> @Measurement(iterations = 20, time = 200, timeUnit=MILLISECONDS)
> @State(Scope.Thread)
> @Threads(1)
> @Fork(2)
> public class Measure
> {
>    public static final int ARRAY_SIZE = 1024;
>    private final int[] array = new int[ARRAY_SIZE];
>
>    @Setup public void setup() {
>      final Random random = new Random();
>      for (int i = 0; i < ARRAY_SIZE; ++i) {
>        final int x = random.nextInt();
>        array[i] = x == 0? 1 : x;
>      }
>    }
>
>    @GenerateMicroBenchmark public int withoutExitPoint() {
>      final int[] array = this.array;
>      int result = 0;
>      for (int i = 0; i < array.length; i++) {
>        final int entry = array[i];
>        result ^= i + entry;
>      }
>      return result;
>    }
>
>    @GenerateMicroBenchmark public int withExitPoint() {
>      final int[] array = this.array;
>      int result = 0;
>      for (int i = 0; i < array.length; i++) {
>        final int entry = array[i];
>        result ^= i + array[i];
>        if (entry == 0) break;
>      }
>      return result;
>    }
> }
>


More information about the hotspot-compiler-dev mailing list