status of VM long loop optimizations - call for action

Mon Dec 13 19:27:13 UTC 2021

On 13/12/2021 17:50, Rado Smogura wrote:
> Hi,
>
>
> Checked from my side, and actually on my side this code is more 
> complicated and there's no direct loop unrolling.

Which code is more complicated? The one with the new patch which enabled 
the VM to do its job?

If that's the case it would be great if we could boil it down to a 
simple-ish reproducer.

Thanks
Maurizio

>
>
> Kind regards,
>
> Rado
>
> On 13.12.2021 09:57, Rado Smogura wrote:
>> Definitely yes!
>>
>>
>> I'll check later the ASM output and graphs, to see if there's 
>> something which may look strange.
>>
>>
>> I really would like to see numbers with heap arrays pinning!
>>
>>
>> BR,
>>
>> Rado
>>
>>
>> P. S. I think I need to finish package it and put it into public repo 
>> - it's drop-in replacement cooperating with current JDK Socket 
>> factories.
>>
>> On 11.12.2021 23:30, Maurizio Cimadamore wrote:
>>> Thanks Rado,
>>> seems like we're in the same ballpark? (which is great, since we're 
>>> removing a lot of complexity from the implementation)
>>>
>>> (P.S. it's impressive how much faster your implementation is 
>>> compared to JDK sockets, in the 2nd and 3rd bench).
>>>
>>> Maurizio
>>>
>>> On 11/12/2021 16:38, Rado Smogura wrote:
>>>> Hi all,
>>>>
>>>>
>>>> Just for comparison, run against April commits
>>>>
>>>>
>>>> "Before"
>>>>
>>>> Benchmark                          Mode  Cnt        Score Error Units
>>>> SocketReadJdk.teatRead4k          thrpt    5   939997.688 ± 
>>>> 74877.602  ops/s
>>>> SocketReadJdk.testRead16b         thrpt    5  1881053.005 ± 
>>>> 72637.626  ops/s
>>>> SocketReadJdk.testRead8bOffset    thrpt    5  1924527.582 ± 
>>>> 38308.317  ops/s
>>>> SocketReadPosix.teatRead4k        thrpt    5  1157621.341 ± 
>>>> 106649.696  ops/s
>>>> SocketReadPosix.testRead16b       thrpt    5  3059826.951 ± 
>>>> 232852.053  ops/s
>>>> SocketReadPosix.testRead8bOffset  thrpt    5  2983402.371 ± 
>>>> 269646.104  ops/s
>>>>
>>>>
>>>> "Current" - other runs
>>>> Benchmark                          Mode  Cnt        Score Error Units
>>>> Benchmark                          Mode  Cnt        Score Error Units
>>>> SocketReadPosix.teatRead4k        thrpt    5  1163288.078 ± 
>>>> 263855.161  ops/s
>>>> SocketReadPosix.testRead16b       thrpt    5  3118810.213 ± 
>>>> 68389.408  ops/s
>>>> SocketReadPosix.testRead8bOffset  thrpt    5  2696627.066 ± 
>>>> 297527.130  ops/s
>>>>
>>>> I have to point out that this benchmark is not perfect, as it 
>>>> really reads data from the back server, so other noise can apply.
>>>>
>>>> BR,
>>>>
>>>> Rado
>>>>
>>>>> Hi Maurizio,
>>>>>
>>>>>
>>>>> Checked against JExtract branch 
>>>>> 2617fbfa3050913d34906f87027b8be8f10e53a9
>>>>>
>>>>> Project: 
>>>>> https://urldefense.com/v3/__https://github.com/rsmogura/panama-io__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0FiCfiSk$ 
>>>>>
>>>>> Benchmark                          Mode  Cnt        Score Error Units
>>>>> SocketReadJdk.teatRead4k          thrpt    5   947424.435 ± 
>>>>> 74922.610  ops/s
>>>>> SocketReadJdk.testRead16b         thrpt    5  1823338.685 ± 
>>>>> 33626.860  ops/s
>>>>> SocketReadJdk.testRead8bOffset    thrpt    5  1817956.804 ± 
>>>>> 25456.785  ops/s
>>>>> SocketReadPosix.teatRead4k        thrpt    5  1205470.257 ± 
>>>>> 548343.499  ops/s
>>>>> SocketReadPosix.testRead16b       thrpt    5  2710119.664 ± 
>>>>> 227053.749  ops/s
>>>>> SocketReadPosix.testRead8bOffset  thrpt    5  2968281.197 ± 
>>>>> 216628.917  ops/s
>>>>>
>>>>> Numbers look amazing - I have to check if it's still does what 
>>>>> it's intended to do (so write some integration test).
>>>>>
>>>>> Kind regards,
>>>>>
>>>>> Rado
>>>>>
>>>>> On 10.12.2021 23:33, Remi Forax wrote:
>>>>>> Hi Ty,
>>>>>> there is a simple trick to be sure to get the best performance.
>>>>>>
>>>>>> When you create the VarHandle, call withInvokeExactBehavior [1] 
>>>>>> on it,
>>>>>> the returned VarHandle will throw an error at runtime instead of 
>>>>>> trying to convert arguments.
>>>>>>
>>>>>> Rémi
>>>>>>
>>>>>> [1] 
>>>>>> https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/invoke/VarHandle.html#withInvokeExactBehavior()
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>> From: "Ty Young" <youngty1997 at gmail.com>
>>>>>>> To: "Maurizio Cimadamore" <maurizio.cimadamore at oracle.com>, 
>>>>>>> "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
>>>>>>> Sent: Friday, December 10, 2021 11:18:45 PM
>>>>>>> Subject: Re: status of VM long loop optimizations - call for action
>>>>>>> Yeah, I forgot that. Apologies.
>>>>>>>
>>>>>>>
>>>>>>> On 12/10/21 4:06 PM, Maurizio Cimadamore wrote:
>>>>>>>> Hi,
>>>>>>>> I don't think the 1ns difference is real - if you look at the 
>>>>>>>> error in
>>>>>>>> the second run is higher than that, so it's in the noise.
>>>>>>>>
>>>>>>>> And, since there's no loop, I don't think this specific kind of
>>>>>>>> benchmark should be affected in any way by the VM improvements. 
>>>>>>>> What
>>>>>>>> the VM can help with is to remove bound checks when you keep 
>>>>>>>> accessing
>>>>>>>> a segment in a loop, as C2 is now able to correctly apply an
>>>>>>>> optimization called "bound check elimination" or BCE. This
>>>>>>>> optimization is routinely applied on Java array access, but it 
>>>>>>>> used to
>>>>>>>> fail for memory segments because the bound of a memory segment is
>>>>>>>> stored in a long variable, not an int.
>>>>>>>>
>>>>>>>> That said, note that you are passing inexact arguments to the var
>>>>>>>> handle (e.g. you are passing an int offset instead of a long 
>>>>>>>> one; try
>>>>>>>> to use "0L" instead of "0").
>>>>>>>>
>>>>>>>> Maurizio
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/12/2021 21:34, Ty Young wrote:
>>>>>>>>> A simple write benchmark I had already made for specialized
>>>>>>>>> VarHandles(AKA insertCoordinates) seems to get about 1ns 
>>>>>>>>> consistently
>>>>>>>>> faster, so I guess these changes helped a bit?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Before:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Benchmark                                    Mode Cnt Score Error
>>>>>>>>> Units
>>>>>>>>> VarHandleBenchmark.genericHandleBenchmark    avgt 5 21.155 ±
>>>>>>>>> 0.145  ns/op
>>>>>>>>> VarHandleBenchmark.specFinalHandleBenchmark  avgt 5 0.678 ±
>>>>>>>>> 0.201  ns/op
>>>>>>>>> VarHandleBenchmark.specHandleBenchmark       avgt 5 17.323 ±
>>>>>>>>> 1.324  ns/op
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> After:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Benchmark                                    Mode Cnt Score Error
>>>>>>>>> Units
>>>>>>>>> VarHandleBenchmark.genericHandleBenchmark    avgt 5 20.304 ±
>>>>>>>>> 1.466  ns/op
>>>>>>>>> VarHandleBenchmark.specFinalHandleBenchmark  avgt 5 0.652 ±
>>>>>>>>> 0.156  ns/op
>>>>>>>>> VarHandleBenchmark.specHandleBenchmark       avgt 5 17.266 ±
>>>>>>>>> 1.712  ns/op
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Benchmark:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>      public static final MemorySegment SEGMENT =
>>>>>>>>> MemorySegment.allocateNative(ValueLayout.JAVA_INT,
>>>>>>>>> ResourceScope.newSharedScope());
>>>>>>>>>
>>>>>>>>>      public static final VarHandle GENERIC_HANDLE =
>>>>>>>>> MemoryHandles.varHandle(ValueLayout.JAVA_INT);
>>>>>>>>>
>>>>>>>>>      public static VarHandle SPEC_HANDLE =
>>>>>>>>> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 0);
>>>>>>>>>
>>>>>>>>>      public static final VarHandle SPEC_HANDLE_FINAL =
>>>>>>>>> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 0);
>>>>>>>>>
>>>>>>>>>      @Benchmark
>>>>>>>>>      @BenchmarkMode(Mode.AverageTime)
>>>>>>>>>      @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>>>>>      public void genericHandleBenchmark()
>>>>>>>>>      {
>>>>>>>>>          GENERIC_HANDLE.set(SEGMENT, 0, 5);
>>>>>>>>>      }
>>>>>>>>>
>>>>>>>>>      @Benchmark
>>>>>>>>>      @BenchmarkMode(Mode.AverageTime)
>>>>>>>>>      @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>>>>>      public void specHandleBenchmark()
>>>>>>>>>      {
>>>>>>>>>          SPEC_HANDLE.set(5);
>>>>>>>>>      }
>>>>>>>>>
>>>>>>>>>      @Benchmark
>>>>>>>>>      @BenchmarkMode(Mode.AverageTime)
>>>>>>>>>      @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>>>>>      public void specFinalHandleBenchmark()
>>>>>>>>>      {
>>>>>>>>>          SPEC_HANDLE_FINAL.set(5);
>>>>>>>>>      }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Sort of off-topic but... I don't remember anyone saying 
>>>>>>>>> previously
>>>>>>>>> that insertCoordinates would give that big of a difference(or 
>>>>>>>>> any at
>>>>>>>>> all!) so it's surprising to me. I was expecting a performance
>>>>>>>>> decrease due to the handle no longer being static-final. Can 
>>>>>>>>> javac
>>>>>>>>> maybe optimize this so that any case where:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> GENERIC_HANDLE.set(SEGMENT, 0, 5);
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> is, an optimized VarHandle is created at compile time that is
>>>>>>>>> equivalent to SPEC_HANDLE and inserted there instead?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 12/10/21 4:55 AM, Maurizio Cimadamore wrote:
>>>>>>>>>> (resending since mailing lists were down yesterday - I 
>>>>>>>>>> apologize if
>>>>>>>>>> this results in duplicates).
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>> few days ago some VM enhancements were integrated [1, 2], so 
>>>>>>>>>> it is
>>>>>>>>>> time to take a look again at where we are.
>>>>>>>>>>
>>>>>>>>>> I put together a branch which removes all workarounds (both 
>>>>>>>>>> for long
>>>>>>>>>> loops and for alignment checks):
>>>>>>>>>>
>>>>>>>>>> https://urldefense.com/v3/__https://github.com/mcimadamore/jdk/tree/long_loop_workarounds_removal__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0wgQ-IjY$ 
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I also ran memory access benchmarks before/after, to see what 
>>>>>>>>>> the
>>>>>>>>>> difference is like - here's a visual report:
>>>>>>>>>>
>>>>>>>>>> https://urldefense.com/v3/__https://jmh.morethan.io/?gists=dfa7075db33f7e6a2690ac80a64aa252,7f894f48460a6a0c9891cbe3158b43a7__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0CJV9Oz8$ 
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Overall, I think the numbers are solid. The branch w/o 
>>>>>>>>>> workarounds
>>>>>>>>>> keep up with mainline in basically all cases but one 
>>>>>>>>>> (UnrolledAccess
>>>>>>>>>> - this code pattern needs more work in the VM, but Roland 
>>>>>>>>>> Westrelin
>>>>>>>>>> has identified a possible fix for it). In some cases (parallel
>>>>>>>>>> tests) we see quite a big jump forward.
>>>>>>>>>>
>>>>>>>>>> I think it's hard to say how these results will translate in 
>>>>>>>>>> real
>>>>>>>>>> world - my gut feeling is that the simpler bound checking 
>>>>>>>>>> logic will
>>>>>>>>>> almost invariably result in performance improvements with more
>>>>>>>>>> complex code patterns, despite what synthetic benchmark might 
>>>>>>>>>> say
>>>>>>>>>> (the current logic in mainline is fragile as it has to guard 
>>>>>>>>>> against
>>>>>>>>>> integer overflow, which in turns sometimes kills BCE 
>>>>>>>>>> optimizations).
>>>>>>>>>>
>>>>>>>>>> So I'd be inclined to integrate these changes in 18.
>>>>>>>>>>
>>>>>>>>>> If you gave a project that works agaist the Java 18 API, it 
>>>>>>>>>> would be
>>>>>>>>>> very helpful for us if you could try it on the above branch and
>>>>>>>>>> report back. This will help us make a more informed decision.
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>> Maurizio
>>>>>>>>>>
>>>>>>>>>> [1] - https://bugs.openjdk.java.net/browse/JDK-8276116
>>>>>>>>>> [2] - https://bugs.openjdk.java.net/browse/JDK-8277850
>>>>>>>>>>
>>>>>>>>>>