scalar replacement of arrays affected by minor changes to surrounding code

Thu Sep 19 01:23:41 UTC 2019

On Wed, Sep 18, 2019 at 7:24 PM Govind Jajoo <gjajoo+java at gmail.com> wrote:

> Vitaly-
>
> Thanks for the pointers - we'd already considered many of them (except for
> disabling loop unrolling) with the following observations
>
Yeah, I figured you have.  But always worth checking :).

>
> > Graal’s presence.  By the way, you should try your benchmark/tests on it
> and see how it fares.
> When we tried this with graal CE, it did seem less flaky but unfortunately
> the scalar replacement didn't happen on the most idiomatic code i.e.
> ArrayWrapper.loopSum(). The reason we need the ArrayWrapper is to avoid the
> default implementation of array's equals/hashcode :-(. And it's rather
> disappointing that despite using a state of the art optimizer from year
> 2019, one may have to resort to writing code that looks like Duff's device!
> (But I digress). We saw other performance regressions using graal so it's
> not an option for us for the time being.
>
Ok cool.  It’s strange that neither JIT handles this seemingly pedestrian
code shape.  Have you reported the EA issue to Graal’s github?
https://github.com/oracle/graal.  They might be interested in your other
regressions vs C2 as well, but up to you.

>
> > test with a more canonical forward loop
> That's how I wrote the first test and did try all variations -
> forward/backward/for(int i: arr) etc... and none of it makes any
> difference. Another variation I tried was to use Integer array instead to
> no avail.
>
Yeah, ok.

>
> >Another thing worth trying is disabling loop unrolling.
> Per your suggestion I tried disabling it by setting -XX:LoopUnrollLimit=1 and
> it doesn't seem to make any difference with either graal or c2 in this case.
>
Ok.

>
> >  It’s more like a nice bonus optimization: great if it happens, Ok if
> not.
> Yea unfortunately it does seem to be a case of so close! yet so far. I was
> honestly hoping we were missing something and someone on this list would
> suggest an incantation of vm options that would make it work.
>
Yeah, not sure - seems like fairly vanilla code.  I was going to suggest
that JMH is interfering somehow (doubtful but ...), but then it should’ve
equally tripped up the manual unroll via switch.

If you have the interest and time to dig deeper, try a fastdebug build that
someone recommended up top and enable EA tracing output.  It may shed some
light.

>
> Thanks,
> Govind
> On Wed, Sep 18, 2019 at 1:43 PM Vitaly Davidovich <vitalyd at gmail.com>
> wrote:
>
>> EA is pretty flakey, at least in C2.  It’s impressive when it works, but
>> there’re lots of scenarios where it doesn’t, unfortunately.  I’m not even
>> certain how much, if any, effort is directed at improving it given Graal’s
>> presence.  By the way, you should try your benchmark/tests on it and see
>> how it fares.
>>
>> That said, have you tried your test with a more canonical forward loop
>> (ie i = 0; i < arr.length; i++)? It’s a bit silly and might make no
>> difference, particularly since loop structure is usually canonicalized, but
>> stranger things have happened.
>>
>> Another thing worth trying is disabling loop unrolling.  I don’t recall
>> the exact details, but there’s some optimization pass ordering issue
>> between EA and loop unrolling, which may cause missing scalar replacement.
>>
>> But beyond the intellectual curiosity angle behind your question, you
>> really shouldn’t rely on EA to recover performance (or GC pressure) in
>> critical scenarios.  It’s more like a nice bonus optimization: great if it
>> happens, Ok if not.
>>
>> On Mon, Sep 16, 2019 at 6:08 PM Govind Jajoo <gjajoo+java at gmail.com>
>> wrote:
>>
>>> hi Eric,
>>>
>>> We're operating well within the default limit of
>>> -XX:EliminateAllocationArraySizeLimit
>>> and as shown in the tests, escape analysis is able to identify and elide
>>> the array allocations for hand-unrolled loops. What we're trying to
>>> figure
>>> out is why a loop or an object wrapper is affecting this optimization?
>>> We've tried with and without the ... args, but creating a temporary array
>>> instead and it makes no difference (Examples checked in to the github
>>> repo).
>>>
>>> Are you suggesting that this optimization is not supported in presence of
>>> loops?
>>>
>>> Thanks,
>>> Govind
>>>
>>>
>>> On Mon, Sep 16, 2019 at 11:40 PM Eric Caspole <eric.caspole at oracle.com>
>>> wrote:
>>>
>>> > Hi Govind,
>>> > When you use ... to pass parameters and receive the array, the array
>>> > must be created to pass the parameters, so it is expected to get some
>>> > allocation and GCs. You can see it in the bytecode for your loopSum:
>>> >
>>> >    public void loopSum(org.openjdk.jmh.infra.Blackhole);
>>> >      descriptor: (Lorg/openjdk/jmh/infra/Blackhole;)V
>>> >      Code:
>>> >         0: aload_1
>>> >         1: iconst_2
>>> >         2: newarray       int
>>> >         4: dup
>>> >         5: iconst_0
>>> >         6: invokestatic  #6                  // Method next:()I
>>> >         9: iastore
>>> >        10: dup
>>> >        11: iconst_1
>>> >        12: invokestatic  #6                  // Method next:()I
>>> >        15: iastore
>>> >        16: invokestatic  #2                  // Method loop:([I)I
>>> >        19: invokevirtual #7                  // Method
>>> > org/openjdk/jmh/infra/Blackhole.consume:(I)V
>>> >        22: return
>>> >
>>> > If you want to reduce the object allocation maybe you can tweak your
>>> > code to not pass arguments by ...
>>> > Regards,
>>> > Eric
>>> >
>>> >
>>> > On 9/16/19 11:19, Govind Jajoo wrote:
>>> > > Hi team,
>>> > >
>>> > > We're seeing some unexpected behaviour with scalar replacement of
>>> arrays
>>> > > getting affected by subtle changes to surrounding code. If a newly
>>> > created
>>> > > array is accessed in a loop or wrapped inside another object, the
>>> > > optimization gets disabled easily. For example when we run the
>>> following
>>> > > benchmark in jmh (jdk11/linux)
>>> > >
>>> > > public class ArrayLoop {
>>> > >      private static Random s_r = new Random();
>>> > >      private static int next() { return s_r.nextInt() % 1000; }
>>> > >
>>> > >      private static int loop(int... arr) {
>>> > >          int sum = 0;
>>> > >          for (int i = arr.length - 1; i >= 0; sum += arr[i--]) { ; }
>>> > >          return sum;
>>> > >      }
>>> > >
>>> > >      @Benchmark
>>> > >      public void loopSum(Blackhole bh) {
>>> > >          bh.consume(loop(next(), next()));
>>> > >      }
>>> > > }
>>> > >
>>> > > # JMH version: 1.21
>>> > > # VM version: JDK 11.0.4, OpenJDK 64-Bit Server VM, 11.0.4+11
>>> > > ArrayLoop.loopSum                                     avgt    3
>>>  26.124
>>> > ±
>>> > >     7.727   ns/op
>>> > > ArrayLoop.loopSum:·gc.alloc.rate                      avgt    3
>>> 700.529
>>> > ±
>>> > >   208.524  MB/sec
>>> > > ArrayLoop.loopSum:·gc.count                           avgt    3
>>> 5.000
>>> > >            counts
>>> > >
>>> > > We see unexpected gc activity. When we avoid the loop by "unrolling"
>>> it
>>> > and
>>> > > adding the following to the ArrayLoop class above
>>> > >
>>> > >      // silly manually unrolled loop
>>> > >      private static int unrolled(int... arr) {
>>> > >          int sum = 0;
>>> > >          switch (arr.length) {
>>> > >              default: for (int i = arr.length - 1; i >= 4; sum +=
>>> > arr[i--])
>>> > > { ; }
>>> > >              case 4: sum += arr[3];
>>> > >              case 3: sum += arr[2];
>>> > >              case 2: sum += arr[1];
>>> > >              case 1: sum += arr[0];
>>> > >          }
>>> > >          return sum;
>>> > >      }
>>> > >
>>> > >      @Benchmark
>>> > >      public void unrolledSum(Blackhole bh) {
>>> > >          bh.consume(unrolled(next(), next()));
>>> > >      }
>>> > >
>>> > > #
>>> > > ArrayLoop.unrolledSum                                      avgt    3
>>> > > 25.076 ±    1.711   ns/op
>>> > > ArrayLoop.unrolledSum:·gc.alloc.rate                       avgt
>>> 3   ≈
>>> > > 10⁻⁴             MB/sec
>>> > > ArrayLoop.unrolledSum:·gc.count                            avgt    3
>>> >   ≈
>>> > > 0             counts
>>> > >
>>> > > scalar replacement kicks in as expected. Then to try out a more
>>> realistic
>>> > > scenario representing our usage, we added the following wrapper and
>>> > > benchmarks
>>> > >
>>> > >      private static class ArrayWrapper {
>>> > >          final int[] arr;
>>> > >          ArrayWrapper(int... many) { arr = many; }
>>> > >          int loopSum() { return loop(arr); }
>>> > >          int unrolledSum() { return unrolled(arr); }
>>> > >      }
>>> > >
>>> > >      @Benchmark
>>> > >      public void wrappedUnrolledSum(Blackhole bh) {
>>> > >          bh.consume(new ArrayWrapper(next(), next()).unrolledSum());
>>> > >      }
>>> > >
>>> > >      @Benchmark
>>> > >      public void wrappedLoopSum(Blackhole bh) {
>>> > >          bh.consume(new ArrayWrapper(next(), next()).loopSum());
>>> > >      }
>>> > >
>>> > > #
>>> > > ArrayLoop.wrappedLoopSum                                   avgt    3
>>> > > 26.190 ±   18.853   ns/op
>>> > > ArrayLoop.wrappedLoopSum:·gc.alloc.rate                    avgt    3
>>> > >   699.433 ±  512.953  MB/sec
>>> > > ArrayLoop.wrappedLoopSum:·gc.count                         avgt    3
>>> > >   6.000             counts
>>> > > ArrayLoop.wrappedUnrolledSum                               avgt    3
>>> > > 25.877 ±   13.348   ns/op
>>> > > ArrayLoop.wrappedUnrolledSum:·gc.alloc.rate                avgt    3
>>> > >   707.440 ±  360.702  MB/sec
>>> > > ArrayLoop.wrappedUnrolledSum:·gc.count                     avgt    3
>>> > >   6.000             counts
>>> > >
>>> > > While the LoopSum behaviour is same as before here, even the
>>> UnrolledSum
>>> > > benchmark starts to show gc activity. What gives?
>>> > >
>>> > > Thanks,
>>> > > Govind
>>> > > PS: MCVE available at https://github.com/gjajoo/EA/
>>> > >
>>> >
>>>
>> --
>> Sent from my phone
>>
> --
Sent from my phone