status of VM long loop optimizations - call for action

Mon Dec 13 08:57:01 UTC 2021

Definitely yes!

I'll check later the ASM output and graphs, to see if there's something 
which may look strange.

I really would like to see numbers with heap arrays pinning!

BR,

Rado

P. S. I think I need to finish package it and put it into public repo - 
it's drop-in replacement cooperating with current JDK Socket factories.

On 11.12.2021 23:30, Maurizio Cimadamore wrote:
> Thanks Rado,
> seems like we're in the same ballpark? (which is great, since we're 
> removing a lot of complexity from the implementation)
>
> (P.S. it's impressive how much faster your implementation is compared 
> to JDK sockets, in the 2nd and 3rd bench).
>
> Maurizio
>
> On 11/12/2021 16:38, Rado Smogura wrote:
>> Hi all,
>>
>>
>> Just for comparison, run against April commits
>>
>>
>> "Before"
>>
>> Benchmark                          Mode  Cnt        Score Error Units
>> SocketReadJdk.teatRead4k          thrpt    5   939997.688 ± 
>> 74877.602  ops/s
>> SocketReadJdk.testRead16b         thrpt    5  1881053.005 ± 
>> 72637.626  ops/s
>> SocketReadJdk.testRead8bOffset    thrpt    5  1924527.582 ± 
>> 38308.317  ops/s
>> SocketReadPosix.teatRead4k        thrpt    5  1157621.341 ± 
>> 106649.696  ops/s
>> SocketReadPosix.testRead16b       thrpt    5  3059826.951 ± 
>> 232852.053  ops/s
>> SocketReadPosix.testRead8bOffset  thrpt    5  2983402.371 ± 
>> 269646.104  ops/s
>>
>>
>> "Current" - other runs
>> Benchmark                          Mode  Cnt        Score Error Units
>> Benchmark                          Mode  Cnt        Score Error Units
>> SocketReadPosix.teatRead4k        thrpt    5  1163288.078 ± 
>> 263855.161  ops/s
>> SocketReadPosix.testRead16b       thrpt    5  3118810.213 ± 
>> 68389.408  ops/s
>> SocketReadPosix.testRead8bOffset  thrpt    5  2696627.066 ± 
>> 297527.130  ops/s
>>
>> I have to point out that this benchmark is not perfect, as it really 
>> reads data from the back server, so other noise can apply.
>>
>> BR,
>>
>> Rado
>>
>>> Hi Maurizio,
>>>
>>>
>>> Checked against JExtract branch 
>>> 2617fbfa3050913d34906f87027b8be8f10e53a9
>>>
>>> Project: 
>>> https://urldefense.com/v3/__https://github.com/rsmogura/panama-io__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0FiCfiSk$ 
>>>
>>> Benchmark                          Mode  Cnt        Score Error Units
>>> SocketReadJdk.teatRead4k          thrpt    5   947424.435 ± 
>>> 74922.610  ops/s
>>> SocketReadJdk.testRead16b         thrpt    5  1823338.685 ± 
>>> 33626.860  ops/s
>>> SocketReadJdk.testRead8bOffset    thrpt    5  1817956.804 ± 
>>> 25456.785  ops/s
>>> SocketReadPosix.teatRead4k        thrpt    5  1205470.257 ± 
>>> 548343.499  ops/s
>>> SocketReadPosix.testRead16b       thrpt    5  2710119.664 ± 
>>> 227053.749  ops/s
>>> SocketReadPosix.testRead8bOffset  thrpt    5  2968281.197 ± 
>>> 216628.917  ops/s
>>>
>>> Numbers look amazing - I have to check if it's still does what it's 
>>> intended to do (so write some integration test).
>>>
>>> Kind regards,
>>>
>>> Rado
>>>
>>> On 10.12.2021 23:33, Remi Forax wrote:
>>>> Hi Ty,
>>>> there is a simple trick to be sure to get the best performance.
>>>>
>>>> When you create the VarHandle, call withInvokeExactBehavior [1] on it,
>>>> the returned VarHandle will throw an error at runtime instead of 
>>>> trying to convert arguments.
>>>>
>>>> Rémi
>>>>
>>>> [1] 
>>>> https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/invoke/VarHandle.html#withInvokeExactBehavior()
>>>>
>>>> ----- Original Message -----
>>>>> From: "Ty Young" <youngty1997 at gmail.com>
>>>>> To: "Maurizio Cimadamore" <maurizio.cimadamore at oracle.com>, 
>>>>> "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
>>>>> Sent: Friday, December 10, 2021 11:18:45 PM
>>>>> Subject: Re: status of VM long loop optimizations - call for action
>>>>> Yeah, I forgot that. Apologies.
>>>>>
>>>>>
>>>>> On 12/10/21 4:06 PM, Maurizio Cimadamore wrote:
>>>>>> Hi,
>>>>>> I don't think the 1ns difference is real - if you look at the 
>>>>>> error in
>>>>>> the second run is higher than that, so it's in the noise.
>>>>>>
>>>>>> And, since there's no loop, I don't think this specific kind of
>>>>>> benchmark should be affected in any way by the VM improvements. What
>>>>>> the VM can help with is to remove bound checks when you keep 
>>>>>> accessing
>>>>>> a segment in a loop, as C2 is now able to correctly apply an
>>>>>> optimization called "bound check elimination" or BCE. This
>>>>>> optimization is routinely applied on Java array access, but it 
>>>>>> used to
>>>>>> fail for memory segments because the bound of a memory segment is
>>>>>> stored in a long variable, not an int.
>>>>>>
>>>>>> That said, note that you are passing inexact arguments to the var
>>>>>> handle (e.g. you are passing an int offset instead of a long one; 
>>>>>> try
>>>>>> to use "0L" instead of "0").
>>>>>>
>>>>>> Maurizio
>>>>>>
>>>>>>
>>>>>> On 10/12/2021 21:34, Ty Young wrote:
>>>>>>> A simple write benchmark I had already made for specialized
>>>>>>> VarHandles(AKA insertCoordinates) seems to get about 1ns 
>>>>>>> consistently
>>>>>>> faster, so I guess these changes helped a bit?
>>>>>>>
>>>>>>>
>>>>>>> Before:
>>>>>>>
>>>>>>>
>>>>>>> Benchmark                                    Mode  Cnt Score Error
>>>>>>> Units
>>>>>>> VarHandleBenchmark.genericHandleBenchmark    avgt    5 21.155 ±
>>>>>>> 0.145  ns/op
>>>>>>> VarHandleBenchmark.specFinalHandleBenchmark  avgt    5 0.678 ±
>>>>>>> 0.201  ns/op
>>>>>>> VarHandleBenchmark.specHandleBenchmark       avgt    5 17.323 ±
>>>>>>> 1.324  ns/op
>>>>>>>
>>>>>>>
>>>>>>> After:
>>>>>>>
>>>>>>>
>>>>>>> Benchmark                                    Mode  Cnt Score Error
>>>>>>> Units
>>>>>>> VarHandleBenchmark.genericHandleBenchmark    avgt    5 20.304 ±
>>>>>>> 1.466  ns/op
>>>>>>> VarHandleBenchmark.specFinalHandleBenchmark  avgt    5 0.652 ±
>>>>>>> 0.156  ns/op
>>>>>>> VarHandleBenchmark.specHandleBenchmark       avgt    5 17.266 ±
>>>>>>> 1.712  ns/op
>>>>>>>
>>>>>>>
>>>>>>> Benchmark:
>>>>>>>
>>>>>>>
>>>>>>>      public static final MemorySegment SEGMENT =
>>>>>>> MemorySegment.allocateNative(ValueLayout.JAVA_INT,
>>>>>>> ResourceScope.newSharedScope());
>>>>>>>
>>>>>>>      public static final VarHandle GENERIC_HANDLE =
>>>>>>> MemoryHandles.varHandle(ValueLayout.JAVA_INT);
>>>>>>>
>>>>>>>      public static VarHandle SPEC_HANDLE =
>>>>>>> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 0);
>>>>>>>
>>>>>>>      public static final VarHandle SPEC_HANDLE_FINAL =
>>>>>>> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 0);
>>>>>>>
>>>>>>>      @Benchmark
>>>>>>>      @BenchmarkMode(Mode.AverageTime)
>>>>>>>      @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>>>      public void genericHandleBenchmark()
>>>>>>>      {
>>>>>>>          GENERIC_HANDLE.set(SEGMENT, 0, 5);
>>>>>>>      }
>>>>>>>
>>>>>>>      @Benchmark
>>>>>>>      @BenchmarkMode(Mode.AverageTime)
>>>>>>>      @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>>>      public void specHandleBenchmark()
>>>>>>>      {
>>>>>>>          SPEC_HANDLE.set(5);
>>>>>>>      }
>>>>>>>
>>>>>>>      @Benchmark
>>>>>>>      @BenchmarkMode(Mode.AverageTime)
>>>>>>>      @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>>>      public void specFinalHandleBenchmark()
>>>>>>>      {
>>>>>>>          SPEC_HANDLE_FINAL.set(5);
>>>>>>>      }
>>>>>>>
>>>>>>>
>>>>>>> Sort of off-topic but... I don't remember anyone saying previously
>>>>>>> that insertCoordinates would give that big of a difference(or 
>>>>>>> any at
>>>>>>> all!) so it's surprising to me. I was expecting a performance
>>>>>>> decrease due to the handle no longer being static-final. Can javac
>>>>>>> maybe optimize this so that any case where:
>>>>>>>
>>>>>>>
>>>>>>> GENERIC_HANDLE.set(SEGMENT, 0, 5);
>>>>>>>
>>>>>>>
>>>>>>> is, an optimized VarHandle is created at compile time that is
>>>>>>> equivalent to SPEC_HANDLE and inserted there instead?
>>>>>>>
>>>>>>>
>>>>>>> On 12/10/21 4:55 AM, Maurizio Cimadamore wrote:
>>>>>>>> (resending since mailing lists were down yesterday - I 
>>>>>>>> apologize if
>>>>>>>> this results in duplicates).
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>> few days ago some VM enhancements were integrated [1, 2], so it is
>>>>>>>> time to take a look again at where we are.
>>>>>>>>
>>>>>>>> I put together a branch which removes all workarounds (both for 
>>>>>>>> long
>>>>>>>> loops and for alignment checks):
>>>>>>>>
>>>>>>>> https://urldefense.com/v3/__https://github.com/mcimadamore/jdk/tree/long_loop_workarounds_removal__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0wgQ-IjY$ 
>>>>>>>>
>>>>>>>>
>>>>>>>> I also ran memory access benchmarks before/after, to see what the
>>>>>>>> difference is like - here's a visual report:
>>>>>>>>
>>>>>>>> https://urldefense.com/v3/__https://jmh.morethan.io/?gists=dfa7075db33f7e6a2690ac80a64aa252,7f894f48460a6a0c9891cbe3158b43a7__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0CJV9Oz8$ 
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Overall, I think the numbers are solid. The branch w/o workarounds
>>>>>>>> keep up with mainline in basically all cases but one 
>>>>>>>> (UnrolledAccess
>>>>>>>> - this code pattern needs more work in the VM, but Roland 
>>>>>>>> Westrelin
>>>>>>>> has identified a possible fix for it). In some cases (parallel
>>>>>>>> tests) we see quite a big jump forward.
>>>>>>>>
>>>>>>>> I think it's hard to say how these results will translate in real
>>>>>>>> world - my gut feeling is that the simpler bound checking logic 
>>>>>>>> will
>>>>>>>> almost invariably result in performance improvements with more
>>>>>>>> complex code patterns, despite what synthetic benchmark might say
>>>>>>>> (the current logic in mainline is fragile as it has to guard 
>>>>>>>> against
>>>>>>>> integer overflow, which in turns sometimes kills BCE 
>>>>>>>> optimizations).
>>>>>>>>
>>>>>>>> So I'd be inclined to integrate these changes in 18.
>>>>>>>>
>>>>>>>> If you gave a project that works agaist the Java 18 API, it 
>>>>>>>> would be
>>>>>>>> very helpful for us if you could try it on the above branch and
>>>>>>>> report back. This will help us make a more informed decision.
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>> Maurizio
>>>>>>>>
>>>>>>>> [1] - https://bugs.openjdk.java.net/browse/JDK-8276116
>>>>>>>> [2] - https://bugs.openjdk.java.net/browse/JDK-8277850
>>>>>>>>
>>>>>>>>