status of VM long loop optimizations - call for action

Tue Dec 14 18:56:55 UTC 2021

Hi all,

The CallLeaf is stuff from garbage collector...

BR,

Rado

On 13.12.2021 23:10, Maurizio Cimadamore wrote:
>
> On 13/12/2021 19:49, Rado Smogura wrote:
>> It's mine part.
>>
>>
>> The PosixInputStream.read is just quite large, so it does not get 
>> inlined and even if it would it uses Polling Allocator which uses has 
>> synchronized code (so in any way hard to unroll)
>>
>>
>> However I've found something like this in graph
>>
>> CallLeaf
>>
>> jvms: Binding$Context::ofAllocator @ bci:0 (line 273) 
>> DirectMethodHandle$Holder::invokeStatic @ bci:10 
>> 0x00000008010a7400::invoke @
>
> That's odd - I mean, the BindingContext is used when setting up 
> downcall method handles, or upcall stubs. But should not be invoked in 
> the hot path.
>
> That said, in jdk/jdk calls that need to spill arguments on the stack 
> are not intrinsified (but they are on the panama repo). I wonder if 
> that is playing a role here? Typically, if you see a lot of 
> ProgrammableInvoker stuff showing up in -XX:+PrintInlining, that's the 
> culprit.
>
> Maurizio
>
>>
>>
>> I put there @ForceInline, but for me there's no huge impact.
>>
>>
>> Kind regards,
>>
>> Rado
>>
>>
>> On 13.12.2021 20:27, Maurizio Cimadamore wrote:
>>>
>>> On 13/12/2021 17:50, Rado Smogura wrote:
>>>> Hi,
>>>>
>>>>
>>>> Checked from my side, and actually on my side this code is more 
>>>> complicated and there's no direct loop unrolling.
>>>
>>> Which code is more complicated? The one with the new patch which 
>>> enabled the VM to do its job?
>>>
>>> If that's the case it would be great if we could boil it down to a 
>>> simple-ish reproducer.
>>>
>>> Thanks
>>> Maurizio
>>>
>>>>
>>>>
>>>> Kind regards,
>>>>
>>>> Rado
>>>>
>>>> On 13.12.2021 09:57, Rado Smogura wrote:
>>>>> Definitely yes!
>>>>>
>>>>>
>>>>> I'll check later the ASM output and graphs, to see if there's 
>>>>> something which may look strange.
>>>>>
>>>>>
>>>>> I really would like to see numbers with heap arrays pinning!
>>>>>
>>>>>
>>>>> BR,
>>>>>
>>>>> Rado
>>>>>
>>>>>
>>>>> P. S. I think I need to finish package it and put it into public 
>>>>> repo - it's drop-in replacement cooperating with current JDK 
>>>>> Socket factories.
>>>>>
>>>>> On 11.12.2021 23:30, Maurizio Cimadamore wrote:
>>>>>> Thanks Rado,
>>>>>> seems like we're in the same ballpark? (which is great, since 
>>>>>> we're removing a lot of complexity from the implementation)
>>>>>>
>>>>>> (P.S. it's impressive how much faster your implementation is 
>>>>>> compared to JDK sockets, in the 2nd and 3rd bench).
>>>>>>
>>>>>> Maurizio
>>>>>>
>>>>>> On 11/12/2021 16:38, Rado Smogura wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>>
>>>>>>> Just for comparison, run against April commits
>>>>>>>
>>>>>>>
>>>>>>> "Before"
>>>>>>>
>>>>>>> Benchmark                          Mode  Cnt Score Error Units
>>>>>>> SocketReadJdk.teatRead4k          thrpt    5 939997.688 ± 
>>>>>>> 74877.602  ops/s
>>>>>>> SocketReadJdk.testRead16b         thrpt    5 1881053.005 ± 
>>>>>>> 72637.626  ops/s
>>>>>>> SocketReadJdk.testRead8bOffset    thrpt    5 1924527.582 ± 
>>>>>>> 38308.317  ops/s
>>>>>>> SocketReadPosix.teatRead4k        thrpt    5 1157621.341 ± 
>>>>>>> 106649.696  ops/s
>>>>>>> SocketReadPosix.testRead16b       thrpt    5 3059826.951 ± 
>>>>>>> 232852.053  ops/s
>>>>>>> SocketReadPosix.testRead8bOffset  thrpt    5 2983402.371 ± 
>>>>>>> 269646.104  ops/s
>>>>>>>
>>>>>>>
>>>>>>> "Current" - other runs
>>>>>>> Benchmark                          Mode  Cnt Score Error Units
>>>>>>> Benchmark                          Mode  Cnt Score Error Units
>>>>>>> SocketReadPosix.teatRead4k        thrpt    5 1163288.078 ± 
>>>>>>> 263855.161  ops/s
>>>>>>> SocketReadPosix.testRead16b       thrpt    5 3118810.213 ± 
>>>>>>> 68389.408  ops/s
>>>>>>> SocketReadPosix.testRead8bOffset  thrpt    5 2696627.066 ± 
>>>>>>> 297527.130  ops/s
>>>>>>>
>>>>>>> I have to point out that this benchmark is not perfect, as it 
>>>>>>> really reads data from the back server, so other noise can apply.
>>>>>>>
>>>>>>> BR,
>>>>>>>
>>>>>>> Rado
>>>>>>>
>>>>>>>> Hi Maurizio,
>>>>>>>>
>>>>>>>>
>>>>>>>> Checked against JExtract branch 
>>>>>>>> 2617fbfa3050913d34906f87027b8be8f10e53a9
>>>>>>>>
>>>>>>>> Project: 
>>>>>>>> https://urldefense.com/v3/__https://github.com/rsmogura/panama-io__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0FiCfiSk$ 
>>>>>>>>
>>>>>>>> Benchmark                          Mode  Cnt Score Error Units
>>>>>>>> SocketReadJdk.teatRead4k          thrpt    5 947424.435 ± 
>>>>>>>> 74922.610  ops/s
>>>>>>>> SocketReadJdk.testRead16b         thrpt    5 1823338.685 ± 
>>>>>>>> 33626.860  ops/s
>>>>>>>> SocketReadJdk.testRead8bOffset    thrpt    5 1817956.804 ± 
>>>>>>>> 25456.785  ops/s
>>>>>>>> SocketReadPosix.teatRead4k        thrpt    5 1205470.257 ± 
>>>>>>>> 548343.499  ops/s
>>>>>>>> SocketReadPosix.testRead16b       thrpt    5 2710119.664 ± 
>>>>>>>> 227053.749  ops/s
>>>>>>>> SocketReadPosix.testRead8bOffset  thrpt    5 2968281.197 ± 
>>>>>>>> 216628.917  ops/s
>>>>>>>>
>>>>>>>> Numbers look amazing - I have to check if it's still does what 
>>>>>>>> it's intended to do (so write some integration test).
>>>>>>>>
>>>>>>>> Kind regards,
>>>>>>>>
>>>>>>>> Rado
>>>>>>>>
>>>>>>>> On 10.12.2021 23:33, Remi Forax wrote:
>>>>>>>>> Hi Ty,
>>>>>>>>> there is a simple trick to be sure to get the best performance.
>>>>>>>>>
>>>>>>>>> When you create the VarHandle, call withInvokeExactBehavior 
>>>>>>>>> [1] on it,
>>>>>>>>> the returned VarHandle will throw an error at runtime instead 
>>>>>>>>> of trying to convert arguments.
>>>>>>>>>
>>>>>>>>> Rémi
>>>>>>>>>
>>>>>>>>> [1] 
>>>>>>>>> https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/invoke/VarHandle.html#withInvokeExactBehavior()
>>>>>>>>>
>>>>>>>>> ----- Original Message -----
>>>>>>>>>> From: "Ty Young" <youngty1997 at gmail.com>
>>>>>>>>>> To: "Maurizio Cimadamore" <maurizio.cimadamore at oracle.com>, 
>>>>>>>>>> "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
>>>>>>>>>> Sent: Friday, December 10, 2021 11:18:45 PM
>>>>>>>>>> Subject: Re: status of VM long loop optimizations - call for 
>>>>>>>>>> action
>>>>>>>>>> Yeah, I forgot that. Apologies.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 12/10/21 4:06 PM, Maurizio Cimadamore wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>> I don't think the 1ns difference is real - if you look at 
>>>>>>>>>>> the error in
>>>>>>>>>>> the second run is higher than that, so it's in the noise.
>>>>>>>>>>>
>>>>>>>>>>> And, since there's no loop, I don't think this specific kind of
>>>>>>>>>>> benchmark should be affected in any way by the VM 
>>>>>>>>>>> improvements. What
>>>>>>>>>>> the VM can help with is to remove bound checks when you keep 
>>>>>>>>>>> accessing
>>>>>>>>>>> a segment in a loop, as C2 is now able to correctly apply an
>>>>>>>>>>> optimization called "bound check elimination" or BCE. This
>>>>>>>>>>> optimization is routinely applied on Java array access, but 
>>>>>>>>>>> it used to
>>>>>>>>>>> fail for memory segments because the bound of a memory 
>>>>>>>>>>> segment is
>>>>>>>>>>> stored in a long variable, not an int.
>>>>>>>>>>>
>>>>>>>>>>> That said, note that you are passing inexact arguments to 
>>>>>>>>>>> the var
>>>>>>>>>>> handle (e.g. you are passing an int offset instead of a long 
>>>>>>>>>>> one; try
>>>>>>>>>>> to use "0L" instead of "0").
>>>>>>>>>>>
>>>>>>>>>>> Maurizio
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 10/12/2021 21:34, Ty Young wrote:
>>>>>>>>>>>> A simple write benchmark I had already made for specialized
>>>>>>>>>>>> VarHandles(AKA insertCoordinates) seems to get about 1ns 
>>>>>>>>>>>> consistently
>>>>>>>>>>>> faster, so I guess these changes helped a bit?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Before:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Benchmark Mode Cnt Score Error
>>>>>>>>>>>> Units
>>>>>>>>>>>> VarHandleBenchmark.genericHandleBenchmark avgt 5 21.155 ±
>>>>>>>>>>>> 0.145  ns/op
>>>>>>>>>>>> VarHandleBenchmark.specFinalHandleBenchmark avgt 5 0.678 ±
>>>>>>>>>>>> 0.201  ns/op
>>>>>>>>>>>> VarHandleBenchmark.specHandleBenchmark avgt 5 17.323 ±
>>>>>>>>>>>> 1.324  ns/op
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> After:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Benchmark Mode Cnt Score Error
>>>>>>>>>>>> Units
>>>>>>>>>>>> VarHandleBenchmark.genericHandleBenchmark avgt 5 20.304 ±
>>>>>>>>>>>> 1.466  ns/op
>>>>>>>>>>>> VarHandleBenchmark.specFinalHandleBenchmark avgt 5 0.652 ±
>>>>>>>>>>>> 0.156  ns/op
>>>>>>>>>>>> VarHandleBenchmark.specHandleBenchmark avgt 5 17.266 ±
>>>>>>>>>>>> 1.712  ns/op
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Benchmark:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>      public static final MemorySegment SEGMENT =
>>>>>>>>>>>> MemorySegment.allocateNative(ValueLayout.JAVA_INT,
>>>>>>>>>>>> ResourceScope.newSharedScope());
>>>>>>>>>>>>
>>>>>>>>>>>>      public static final VarHandle GENERIC_HANDLE =
>>>>>>>>>>>> MemoryHandles.varHandle(ValueLayout.JAVA_INT);
>>>>>>>>>>>>
>>>>>>>>>>>>      public static VarHandle SPEC_HANDLE =
>>>>>>>>>>>> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 
>>>>>>>>>>>> 0);
>>>>>>>>>>>>
>>>>>>>>>>>>      public static final VarHandle SPEC_HANDLE_FINAL =
>>>>>>>>>>>> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 
>>>>>>>>>>>> 0);
>>>>>>>>>>>>
>>>>>>>>>>>>      @Benchmark
>>>>>>>>>>>>      @BenchmarkMode(Mode.AverageTime)
>>>>>>>>>>>>      @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>>>>>>>>      public void genericHandleBenchmark()
>>>>>>>>>>>>      {
>>>>>>>>>>>>          GENERIC_HANDLE.set(SEGMENT, 0, 5);
>>>>>>>>>>>>      }
>>>>>>>>>>>>
>>>>>>>>>>>>      @Benchmark
>>>>>>>>>>>>      @BenchmarkMode(Mode.AverageTime)
>>>>>>>>>>>>      @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>>>>>>>>      public void specHandleBenchmark()
>>>>>>>>>>>>      {
>>>>>>>>>>>>          SPEC_HANDLE.set(5);
>>>>>>>>>>>>      }
>>>>>>>>>>>>
>>>>>>>>>>>>      @Benchmark
>>>>>>>>>>>>      @BenchmarkMode(Mode.AverageTime)
>>>>>>>>>>>>      @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>>>>>>>>      public void specFinalHandleBenchmark()
>>>>>>>>>>>>      {
>>>>>>>>>>>>          SPEC_HANDLE_FINAL.set(5);
>>>>>>>>>>>>      }
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Sort of off-topic but... I don't remember anyone saying 
>>>>>>>>>>>> previously
>>>>>>>>>>>> that insertCoordinates would give that big of a 
>>>>>>>>>>>> difference(or any at
>>>>>>>>>>>> all!) so it's surprising to me. I was expecting a performance
>>>>>>>>>>>> decrease due to the handle no longer being static-final. 
>>>>>>>>>>>> Can javac
>>>>>>>>>>>> maybe optimize this so that any case where:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> GENERIC_HANDLE.set(SEGMENT, 0, 5);
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> is, an optimized VarHandle is created at compile time that is
>>>>>>>>>>>> equivalent to SPEC_HANDLE and inserted there instead?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 12/10/21 4:55 AM, Maurizio Cimadamore wrote:
>>>>>>>>>>>>> (resending since mailing lists were down yesterday - I 
>>>>>>>>>>>>> apologize if
>>>>>>>>>>>>> this results in duplicates).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> few days ago some VM enhancements were integrated [1, 2], 
>>>>>>>>>>>>> so it is
>>>>>>>>>>>>> time to take a look again at where we are.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I put together a branch which removes all workarounds 
>>>>>>>>>>>>> (both for long
>>>>>>>>>>>>> loops and for alignment checks):
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://urldefense.com/v3/__https://github.com/mcimadamore/jdk/tree/long_loop_workarounds_removal__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0wgQ-IjY$ 
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I also ran memory access benchmarks before/after, to see 
>>>>>>>>>>>>> what the
>>>>>>>>>>>>> difference is like - here's a visual report:
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://urldefense.com/v3/__https://jmh.morethan.io/?gists=dfa7075db33f7e6a2690ac80a64aa252,7f894f48460a6a0c9891cbe3158b43a7__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0CJV9Oz8$ 
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Overall, I think the numbers are solid. The branch w/o 
>>>>>>>>>>>>> workarounds
>>>>>>>>>>>>> keep up with mainline in basically all cases but one 
>>>>>>>>>>>>> (UnrolledAccess
>>>>>>>>>>>>> - this code pattern needs more work in the VM, but Roland 
>>>>>>>>>>>>> Westrelin
>>>>>>>>>>>>> has identified a possible fix for it). In some cases 
>>>>>>>>>>>>> (parallel
>>>>>>>>>>>>> tests) we see quite a big jump forward.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think it's hard to say how these results will translate 
>>>>>>>>>>>>> in real
>>>>>>>>>>>>> world - my gut feeling is that the simpler bound checking 
>>>>>>>>>>>>> logic will
>>>>>>>>>>>>> almost invariably result in performance improvements with 
>>>>>>>>>>>>> more
>>>>>>>>>>>>> complex code patterns, despite what synthetic benchmark 
>>>>>>>>>>>>> might say
>>>>>>>>>>>>> (the current logic in mainline is fragile as it has to 
>>>>>>>>>>>>> guard against
>>>>>>>>>>>>> integer overflow, which in turns sometimes kills BCE 
>>>>>>>>>>>>> optimizations).
>>>>>>>>>>>>>
>>>>>>>>>>>>> So I'd be inclined to integrate these changes in 18.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If you gave a project that works agaist the Java 18 API, 
>>>>>>>>>>>>> it would be
>>>>>>>>>>>>> very helpful for us if you could try it on the above 
>>>>>>>>>>>>> branch and
>>>>>>>>>>>>> report back. This will help us make a more informed decision.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers
>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1] - https://bugs.openjdk.java.net/browse/JDK-8276116
>>>>>>>>>>>>> [2] - https://bugs.openjdk.java.net/browse/JDK-8277850
>>>>>>>>>>>>>
>>>>>>>>>>>>>