status of VM long loop optimizations - call for action

Mon Dec 13 17:50:00 UTC 2021

Hi,

Checked from my side, and actually on my side this code is more 
complicated and there's no direct loop unrolling.

Kind regards,

Rado

On 13.12.2021 09:57, Rado Smogura wrote:
> Definitely yes!
>
>
> I'll check later the ASM output and graphs, to see if there's 
> something which may look strange.
>
>
> I really would like to see numbers with heap arrays pinning!
>
>
> BR,
>
> Rado
>
>
> P. S. I think I need to finish package it and put it into public repo 
> - it's drop-in replacement cooperating with current JDK Socket factories.
>
> On 11.12.2021 23:30, Maurizio Cimadamore wrote:
>> Thanks Rado,
>> seems like we're in the same ballpark? (which is great, since we're 
>> removing a lot of complexity from the implementation)
>>
>> (P.S. it's impressive how much faster your implementation is compared 
>> to JDK sockets, in the 2nd and 3rd bench).
>>
>> Maurizio
>>
>> On 11/12/2021 16:38, Rado Smogura wrote:
>>> Hi all,
>>>
>>>
>>> Just for comparison, run against April commits
>>>
>>>
>>> "Before"
>>>
>>> Benchmark                          Mode  Cnt        Score Error Units
>>> SocketReadJdk.teatRead4k          thrpt    5   939997.688 ± 
>>> 74877.602  ops/s
>>> SocketReadJdk.testRead16b         thrpt    5  1881053.005 ± 
>>> 72637.626  ops/s
>>> SocketReadJdk.testRead8bOffset    thrpt    5  1924527.582 ± 
>>> 38308.317  ops/s
>>> SocketReadPosix.teatRead4k        thrpt    5  1157621.341 ± 
>>> 106649.696  ops/s
>>> SocketReadPosix.testRead16b       thrpt    5  3059826.951 ± 
>>> 232852.053  ops/s
>>> SocketReadPosix.testRead8bOffset  thrpt    5  2983402.371 ± 
>>> 269646.104  ops/s
>>>
>>>
>>> "Current" - other runs
>>> Benchmark                          Mode  Cnt        Score Error Units
>>> Benchmark                          Mode  Cnt        Score Error Units
>>> SocketReadPosix.teatRead4k        thrpt    5  1163288.078 ± 
>>> 263855.161  ops/s
>>> SocketReadPosix.testRead16b       thrpt    5  3118810.213 ± 
>>> 68389.408  ops/s
>>> SocketReadPosix.testRead8bOffset  thrpt    5  2696627.066 ± 
>>> 297527.130  ops/s
>>>
>>> I have to point out that this benchmark is not perfect, as it really 
>>> reads data from the back server, so other noise can apply.
>>>
>>> BR,
>>>
>>> Rado
>>>
>>>> Hi Maurizio,
>>>>
>>>>
>>>> Checked against JExtract branch 
>>>> 2617fbfa3050913d34906f87027b8be8f10e53a9
>>>>
>>>> Project: 
>>>> https://urldefense.com/v3/__https://github.com/rsmogura/panama-io__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0FiCfiSk$ 
>>>>
>>>> Benchmark                          Mode  Cnt        Score Error Units
>>>> SocketReadJdk.teatRead4k          thrpt    5   947424.435 ± 
>>>> 74922.610  ops/s
>>>> SocketReadJdk.testRead16b         thrpt    5  1823338.685 ± 
>>>> 33626.860  ops/s
>>>> SocketReadJdk.testRead8bOffset    thrpt    5  1817956.804 ± 
>>>> 25456.785  ops/s
>>>> SocketReadPosix.teatRead4k        thrpt    5  1205470.257 ± 
>>>> 548343.499  ops/s
>>>> SocketReadPosix.testRead16b       thrpt    5  2710119.664 ± 
>>>> 227053.749  ops/s
>>>> SocketReadPosix.testRead8bOffset  thrpt    5  2968281.197 ± 
>>>> 216628.917  ops/s
>>>>
>>>> Numbers look amazing - I have to check if it's still does what it's 
>>>> intended to do (so write some integration test).
>>>>
>>>> Kind regards,
>>>>
>>>> Rado
>>>>
>>>> On 10.12.2021 23:33, Remi Forax wrote:
>>>>> Hi Ty,
>>>>> there is a simple trick to be sure to get the best performance.
>>>>>
>>>>> When you create the VarHandle, call withInvokeExactBehavior [1] on 
>>>>> it,
>>>>> the returned VarHandle will throw an error at runtime instead of 
>>>>> trying to convert arguments.
>>>>>
>>>>> Rémi
>>>>>
>>>>> [1] 
>>>>> https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/invoke/VarHandle.html#withInvokeExactBehavior()
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Ty Young" <youngty1997 at gmail.com>
>>>>>> To: "Maurizio Cimadamore" <maurizio.cimadamore at oracle.com>, 
>>>>>> "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
>>>>>> Sent: Friday, December 10, 2021 11:18:45 PM
>>>>>> Subject: Re: status of VM long loop optimizations - call for action
>>>>>> Yeah, I forgot that. Apologies.
>>>>>>
>>>>>>
>>>>>> On 12/10/21 4:06 PM, Maurizio Cimadamore wrote:
>>>>>>> Hi,
>>>>>>> I don't think the 1ns difference is real - if you look at the 
>>>>>>> error in
>>>>>>> the second run is higher than that, so it's in the noise.
>>>>>>>
>>>>>>> And, since there's no loop, I don't think this specific kind of
>>>>>>> benchmark should be affected in any way by the VM improvements. 
>>>>>>> What
>>>>>>> the VM can help with is to remove bound checks when you keep 
>>>>>>> accessing
>>>>>>> a segment in a loop, as C2 is now able to correctly apply an
>>>>>>> optimization called "bound check elimination" or BCE. This
>>>>>>> optimization is routinely applied on Java array access, but it 
>>>>>>> used to
>>>>>>> fail for memory segments because the bound of a memory segment is
>>>>>>> stored in a long variable, not an int.
>>>>>>>
>>>>>>> That said, note that you are passing inexact arguments to the var
>>>>>>> handle (e.g. you are passing an int offset instead of a long 
>>>>>>> one; try
>>>>>>> to use "0L" instead of "0").
>>>>>>>
>>>>>>> Maurizio
>>>>>>>
>>>>>>>
>>>>>>> On 10/12/2021 21:34, Ty Young wrote:
>>>>>>>> A simple write benchmark I had already made for specialized
>>>>>>>> VarHandles(AKA insertCoordinates) seems to get about 1ns 
>>>>>>>> consistently
>>>>>>>> faster, so I guess these changes helped a bit?
>>>>>>>>
>>>>>>>>
>>>>>>>> Before:
>>>>>>>>
>>>>>>>>
>>>>>>>> Benchmark                                    Mode Cnt Score Error
>>>>>>>> Units
>>>>>>>> VarHandleBenchmark.genericHandleBenchmark    avgt 5 21.155 ±
>>>>>>>> 0.145  ns/op
>>>>>>>> VarHandleBenchmark.specFinalHandleBenchmark  avgt 5 0.678 ±
>>>>>>>> 0.201  ns/op
>>>>>>>> VarHandleBenchmark.specHandleBenchmark       avgt 5 17.323 ±
>>>>>>>> 1.324  ns/op
>>>>>>>>
>>>>>>>>
>>>>>>>> After:
>>>>>>>>
>>>>>>>>
>>>>>>>> Benchmark                                    Mode Cnt Score Error
>>>>>>>> Units
>>>>>>>> VarHandleBenchmark.genericHandleBenchmark    avgt 5 20.304 ±
>>>>>>>> 1.466  ns/op
>>>>>>>> VarHandleBenchmark.specFinalHandleBenchmark  avgt 5 0.652 ±
>>>>>>>> 0.156  ns/op
>>>>>>>> VarHandleBenchmark.specHandleBenchmark       avgt 5 17.266 ±
>>>>>>>> 1.712  ns/op
>>>>>>>>
>>>>>>>>
>>>>>>>> Benchmark:
>>>>>>>>
>>>>>>>>
>>>>>>>>      public static final MemorySegment SEGMENT =
>>>>>>>> MemorySegment.allocateNative(ValueLayout.JAVA_INT,
>>>>>>>> ResourceScope.newSharedScope());
>>>>>>>>
>>>>>>>>      public static final VarHandle GENERIC_HANDLE =
>>>>>>>> MemoryHandles.varHandle(ValueLayout.JAVA_INT);
>>>>>>>>
>>>>>>>>      public static VarHandle SPEC_HANDLE =
>>>>>>>> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 0);
>>>>>>>>
>>>>>>>>      public static final VarHandle SPEC_HANDLE_FINAL =
>>>>>>>> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 0);
>>>>>>>>
>>>>>>>>      @Benchmark
>>>>>>>>      @BenchmarkMode(Mode.AverageTime)
>>>>>>>>      @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>>>>      public void genericHandleBenchmark()
>>>>>>>>      {
>>>>>>>>          GENERIC_HANDLE.set(SEGMENT, 0, 5);
>>>>>>>>      }
>>>>>>>>
>>>>>>>>      @Benchmark
>>>>>>>>      @BenchmarkMode(Mode.AverageTime)
>>>>>>>>      @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>>>>      public void specHandleBenchmark()
>>>>>>>>      {
>>>>>>>>          SPEC_HANDLE.set(5);
>>>>>>>>      }
>>>>>>>>
>>>>>>>>      @Benchmark
>>>>>>>>      @BenchmarkMode(Mode.AverageTime)
>>>>>>>>      @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>>>>      public void specFinalHandleBenchmark()
>>>>>>>>      {
>>>>>>>>          SPEC_HANDLE_FINAL.set(5);
>>>>>>>>      }
>>>>>>>>
>>>>>>>>
>>>>>>>> Sort of off-topic but... I don't remember anyone saying previously
>>>>>>>> that insertCoordinates would give that big of a difference(or 
>>>>>>>> any at
>>>>>>>> all!) so it's surprising to me. I was expecting a performance
>>>>>>>> decrease due to the handle no longer being static-final. Can javac
>>>>>>>> maybe optimize this so that any case where:
>>>>>>>>
>>>>>>>>
>>>>>>>> GENERIC_HANDLE.set(SEGMENT, 0, 5);
>>>>>>>>
>>>>>>>>
>>>>>>>> is, an optimized VarHandle is created at compile time that is
>>>>>>>> equivalent to SPEC_HANDLE and inserted there instead?
>>>>>>>>
>>>>>>>>
>>>>>>>> On 12/10/21 4:55 AM, Maurizio Cimadamore wrote:
>>>>>>>>> (resending since mailing lists were down yesterday - I 
>>>>>>>>> apologize if
>>>>>>>>> this results in duplicates).
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> few days ago some VM enhancements were integrated [1, 2], so 
>>>>>>>>> it is
>>>>>>>>> time to take a look again at where we are.
>>>>>>>>>
>>>>>>>>> I put together a branch which removes all workarounds (both 
>>>>>>>>> for long
>>>>>>>>> loops and for alignment checks):
>>>>>>>>>
>>>>>>>>> https://urldefense.com/v3/__https://github.com/mcimadamore/jdk/tree/long_loop_workarounds_removal__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0wgQ-IjY$ 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I also ran memory access benchmarks before/after, to see what the
>>>>>>>>> difference is like - here's a visual report:
>>>>>>>>>
>>>>>>>>> https://urldefense.com/v3/__https://jmh.morethan.io/?gists=dfa7075db33f7e6a2690ac80a64aa252,7f894f48460a6a0c9891cbe3158b43a7__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0CJV9Oz8$ 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Overall, I think the numbers are solid. The branch w/o 
>>>>>>>>> workarounds
>>>>>>>>> keep up with mainline in basically all cases but one 
>>>>>>>>> (UnrolledAccess
>>>>>>>>> - this code pattern needs more work in the VM, but Roland 
>>>>>>>>> Westrelin
>>>>>>>>> has identified a possible fix for it). In some cases (parallel
>>>>>>>>> tests) we see quite a big jump forward.
>>>>>>>>>
>>>>>>>>> I think it's hard to say how these results will translate in real
>>>>>>>>> world - my gut feeling is that the simpler bound checking 
>>>>>>>>> logic will
>>>>>>>>> almost invariably result in performance improvements with more
>>>>>>>>> complex code patterns, despite what synthetic benchmark might say
>>>>>>>>> (the current logic in mainline is fragile as it has to guard 
>>>>>>>>> against
>>>>>>>>> integer overflow, which in turns sometimes kills BCE 
>>>>>>>>> optimizations).
>>>>>>>>>
>>>>>>>>> So I'd be inclined to integrate these changes in 18.
>>>>>>>>>
>>>>>>>>> If you gave a project that works agaist the Java 18 API, it 
>>>>>>>>> would be
>>>>>>>>> very helpful for us if you could try it on the above branch and
>>>>>>>>> report back. This will help us make a more informed decision.
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>> Maurizio
>>>>>>>>>
>>>>>>>>> [1] - https://bugs.openjdk.java.net/browse/JDK-8276116
>>>>>>>>> [2] - https://bugs.openjdk.java.net/browse/JDK-8277850
>>>>>>>>>
>>>>>>>>>