scalar replacement of arrays affected by minor changes to surrounding code
dean.long at oracle.com
dean.long at oracle.com
Tue Sep 17 03:39:44 UTC 2019
The problem sounds similar to this issue:
https://bugs.openjdk.java.net/browse/JDK-6853701
dl
On 9/16/19 3:07 PM, Govind Jajoo wrote:
> hi Eric,
>
> We're operating well within the default limit of
> -XX:EliminateAllocationArraySizeLimit
> and as shown in the tests, escape analysis is able to identify and elide
> the array allocations for hand-unrolled loops. What we're trying to figure
> out is why a loop or an object wrapper is affecting this optimization?
> We've tried with and without the ... args, but creating a temporary array
> instead and it makes no difference (Examples checked in to the github
> repo).
>
> Are you suggesting that this optimization is not supported in presence of
> loops?
>
> Thanks,
> Govind
>
>
> On Mon, Sep 16, 2019 at 11:40 PM Eric Caspole <eric.caspole at oracle.com>
> wrote:
>
>> Hi Govind,
>> When you use ... to pass parameters and receive the array, the array
>> must be created to pass the parameters, so it is expected to get some
>> allocation and GCs. You can see it in the bytecode for your loopSum:
>>
>> public void loopSum(org.openjdk.jmh.infra.Blackhole);
>> descriptor: (Lorg/openjdk/jmh/infra/Blackhole;)V
>> Code:
>> 0: aload_1
>> 1: iconst_2
>> 2: newarray int
>> 4: dup
>> 5: iconst_0
>> 6: invokestatic #6 // Method next:()I
>> 9: iastore
>> 10: dup
>> 11: iconst_1
>> 12: invokestatic #6 // Method next:()I
>> 15: iastore
>> 16: invokestatic #2 // Method loop:([I)I
>> 19: invokevirtual #7 // Method
>> org/openjdk/jmh/infra/Blackhole.consume:(I)V
>> 22: return
>>
>> If you want to reduce the object allocation maybe you can tweak your
>> code to not pass arguments by ...
>> Regards,
>> Eric
>>
>>
>> On 9/16/19 11:19, Govind Jajoo wrote:
>>> Hi team,
>>>
>>> We're seeing some unexpected behaviour with scalar replacement of arrays
>>> getting affected by subtle changes to surrounding code. If a newly
>> created
>>> array is accessed in a loop or wrapped inside another object, the
>>> optimization gets disabled easily. For example when we run the following
>>> benchmark in jmh (jdk11/linux)
>>>
>>> public class ArrayLoop {
>>> private static Random s_r = new Random();
>>> private static int next() { return s_r.nextInt() % 1000; }
>>>
>>> private static int loop(int... arr) {
>>> int sum = 0;
>>> for (int i = arr.length - 1; i >= 0; sum += arr[i--]) { ; }
>>> return sum;
>>> }
>>>
>>> @Benchmark
>>> public void loopSum(Blackhole bh) {
>>> bh.consume(loop(next(), next()));
>>> }
>>> }
>>>
>>> # JMH version: 1.21
>>> # VM version: JDK 11.0.4, OpenJDK 64-Bit Server VM, 11.0.4+11
>>> ArrayLoop.loopSum avgt 3 26.124
>> ±
>>> 7.727 ns/op
>>> ArrayLoop.loopSum:·gc.alloc.rate avgt 3 700.529
>> ±
>>> 208.524 MB/sec
>>> ArrayLoop.loopSum:·gc.count avgt 3 5.000
>>> counts
>>>
>>> We see unexpected gc activity. When we avoid the loop by "unrolling" it
>> and
>>> adding the following to the ArrayLoop class above
>>>
>>> // silly manually unrolled loop
>>> private static int unrolled(int... arr) {
>>> int sum = 0;
>>> switch (arr.length) {
>>> default: for (int i = arr.length - 1; i >= 4; sum +=
>> arr[i--])
>>> { ; }
>>> case 4: sum += arr[3];
>>> case 3: sum += arr[2];
>>> case 2: sum += arr[1];
>>> case 1: sum += arr[0];
>>> }
>>> return sum;
>>> }
>>>
>>> @Benchmark
>>> public void unrolledSum(Blackhole bh) {
>>> bh.consume(unrolled(next(), next()));
>>> }
>>>
>>> #
>>> ArrayLoop.unrolledSum avgt 3
>>> 25.076 ± 1.711 ns/op
>>> ArrayLoop.unrolledSum:·gc.alloc.rate avgt 3 ≈
>>> 10⁻⁴ MB/sec
>>> ArrayLoop.unrolledSum:·gc.count avgt 3
>> ≈
>>> 0 counts
>>>
>>> scalar replacement kicks in as expected. Then to try out a more realistic
>>> scenario representing our usage, we added the following wrapper and
>>> benchmarks
>>>
>>> private static class ArrayWrapper {
>>> final int[] arr;
>>> ArrayWrapper(int... many) { arr = many; }
>>> int loopSum() { return loop(arr); }
>>> int unrolledSum() { return unrolled(arr); }
>>> }
>>>
>>> @Benchmark
>>> public void wrappedUnrolledSum(Blackhole bh) {
>>> bh.consume(new ArrayWrapper(next(), next()).unrolledSum());
>>> }
>>>
>>> @Benchmark
>>> public void wrappedLoopSum(Blackhole bh) {
>>> bh.consume(new ArrayWrapper(next(), next()).loopSum());
>>> }
>>>
>>> #
>>> ArrayLoop.wrappedLoopSum avgt 3
>>> 26.190 ± 18.853 ns/op
>>> ArrayLoop.wrappedLoopSum:·gc.alloc.rate avgt 3
>>> 699.433 ± 512.953 MB/sec
>>> ArrayLoop.wrappedLoopSum:·gc.count avgt 3
>>> 6.000 counts
>>> ArrayLoop.wrappedUnrolledSum avgt 3
>>> 25.877 ± 13.348 ns/op
>>> ArrayLoop.wrappedUnrolledSum:·gc.alloc.rate avgt 3
>>> 707.440 ± 360.702 MB/sec
>>> ArrayLoop.wrappedUnrolledSum:·gc.count avgt 3
>>> 6.000 counts
>>>
>>> While the LoopSum behaviour is same as before here, even the UnrolledSum
>>> benchmark starts to show gc activity. What gives?
>>>
>>> Thanks,
>>> Govind
>>> PS: MCVE available at https://github.com/gjajoo/EA/
>>>
More information about the hotspot-compiler-dev
mailing list