scalar replacement of arrays affected by minor changes to surrounding code

Tue Sep 17 03:39:44 UTC 2019

The problem sounds similar to this issue: 
https://bugs.openjdk.java.net/browse/JDK-6853701

dl

On 9/16/19 3:07 PM, Govind Jajoo wrote:
> hi Eric,
>
> We're operating well within the default limit of
> -XX:EliminateAllocationArraySizeLimit
> and as shown in the tests, escape analysis is able to identify and elide
> the array allocations for hand-unrolled loops. What we're trying to figure
> out is why a loop or an object wrapper is affecting this optimization?
> We've tried with and without the ... args, but creating a temporary array
> instead and it makes no difference (Examples checked in to the github
> repo).
>
> Are you suggesting that this optimization is not supported in presence of
> loops?
>
> Thanks,
> Govind
>
>
> On Mon, Sep 16, 2019 at 11:40 PM Eric Caspole <eric.caspole at oracle.com>
> wrote:
>
>> Hi Govind,
>> When you use ... to pass parameters and receive the array, the array
>> must be created to pass the parameters, so it is expected to get some
>> allocation and GCs. You can see it in the bytecode for your loopSum:
>>
>>     public void loopSum(org.openjdk.jmh.infra.Blackhole);
>>       descriptor: (Lorg/openjdk/jmh/infra/Blackhole;)V
>>       Code:
>>          0: aload_1
>>          1: iconst_2
>>          2: newarray       int
>>          4: dup
>>          5: iconst_0
>>          6: invokestatic  #6                  // Method next:()I
>>          9: iastore
>>         10: dup
>>         11: iconst_1
>>         12: invokestatic  #6                  // Method next:()I
>>         15: iastore
>>         16: invokestatic  #2                  // Method loop:([I)I
>>         19: invokevirtual #7                  // Method
>> org/openjdk/jmh/infra/Blackhole.consume:(I)V
>>         22: return
>>
>> If you want to reduce the object allocation maybe you can tweak your
>> code to not pass arguments by ...
>> Regards,
>> Eric
>>
>>
>> On 9/16/19 11:19, Govind Jajoo wrote:
>>> Hi team,
>>>
>>> We're seeing some unexpected behaviour with scalar replacement of arrays
>>> getting affected by subtle changes to surrounding code. If a newly
>> created
>>> array is accessed in a loop or wrapped inside another object, the
>>> optimization gets disabled easily. For example when we run the following
>>> benchmark in jmh (jdk11/linux)
>>>
>>> public class ArrayLoop {
>>>       private static Random s_r = new Random();
>>>       private static int next() { return s_r.nextInt() % 1000; }
>>>
>>>       private static int loop(int... arr) {
>>>           int sum = 0;
>>>           for (int i = arr.length - 1; i >= 0; sum += arr[i--]) { ; }
>>>           return sum;
>>>       }
>>>
>>>       @Benchmark
>>>       public void loopSum(Blackhole bh) {
>>>           bh.consume(loop(next(), next()));
>>>       }
>>> }
>>>
>>> # JMH version: 1.21
>>> # VM version: JDK 11.0.4, OpenJDK 64-Bit Server VM, 11.0.4+11
>>> ArrayLoop.loopSum                                     avgt    3   26.124
>> ±
>>>      7.727   ns/op
>>> ArrayLoop.loopSum:·gc.alloc.rate                      avgt    3  700.529
>> ±
>>>    208.524  MB/sec
>>> ArrayLoop.loopSum:·gc.count                           avgt    3    5.000
>>>             counts
>>>
>>> We see unexpected gc activity. When we avoid the loop by "unrolling" it
>> and
>>> adding the following to the ArrayLoop class above
>>>
>>>       // silly manually unrolled loop
>>>       private static int unrolled(int... arr) {
>>>           int sum = 0;
>>>           switch (arr.length) {
>>>               default: for (int i = arr.length - 1; i >= 4; sum +=
>> arr[i--])
>>> { ; }
>>>               case 4: sum += arr[3];
>>>               case 3: sum += arr[2];
>>>               case 2: sum += arr[1];
>>>               case 1: sum += arr[0];
>>>           }
>>>           return sum;
>>>       }
>>>
>>>       @Benchmark
>>>       public void unrolledSum(Blackhole bh) {
>>>           bh.consume(unrolled(next(), next()));
>>>       }
>>>
>>> #
>>> ArrayLoop.unrolledSum                                      avgt    3
>>> 25.076 ±    1.711   ns/op
>>> ArrayLoop.unrolledSum:·gc.alloc.rate                       avgt    3   ≈
>>> 10⁻⁴             MB/sec
>>> ArrayLoop.unrolledSum:·gc.count                            avgt    3
>>    ≈
>>> 0             counts
>>>
>>> scalar replacement kicks in as expected. Then to try out a more realistic
>>> scenario representing our usage, we added the following wrapper and
>>> benchmarks
>>>
>>>       private static class ArrayWrapper {
>>>           final int[] arr;
>>>           ArrayWrapper(int... many) { arr = many; }
>>>           int loopSum() { return loop(arr); }
>>>           int unrolledSum() { return unrolled(arr); }
>>>       }
>>>
>>>       @Benchmark
>>>       public void wrappedUnrolledSum(Blackhole bh) {
>>>           bh.consume(new ArrayWrapper(next(), next()).unrolledSum());
>>>       }
>>>
>>>       @Benchmark
>>>       public void wrappedLoopSum(Blackhole bh) {
>>>           bh.consume(new ArrayWrapper(next(), next()).loopSum());
>>>       }
>>>
>>> #
>>> ArrayLoop.wrappedLoopSum                                   avgt    3
>>> 26.190 ±   18.853   ns/op
>>> ArrayLoop.wrappedLoopSum:·gc.alloc.rate                    avgt    3
>>>    699.433 ±  512.953  MB/sec
>>> ArrayLoop.wrappedLoopSum:·gc.count                         avgt    3
>>>    6.000             counts
>>> ArrayLoop.wrappedUnrolledSum                               avgt    3
>>> 25.877 ±   13.348   ns/op
>>> ArrayLoop.wrappedUnrolledSum:·gc.alloc.rate                avgt    3
>>>    707.440 ±  360.702  MB/sec
>>> ArrayLoop.wrappedUnrolledSum:·gc.count                     avgt    3
>>>    6.000             counts
>>>
>>> While the LoopSum behaviour is same as before here, even the UnrolledSum
>>> benchmark starts to show gc activity. What gives?
>>>
>>> Thanks,
>>> Govind
>>> PS: MCVE available at https://github.com/gjajoo/EA/
>>>