status of VM long loop optimizations - call for action

Mon Dec 13 22:10:55 UTC 2021

On 13/12/2021 19:49, Rado Smogura wrote:
> It's mine part.
>
>
> The PosixInputStream.read is just quite large, so it does not get 
> inlined and even if it would it uses Polling Allocator which uses has 
> synchronized code (so in any way hard to unroll)
>
>
> However I've found something like this in graph
>
> CallLeaf
>
> jvms: Binding$Context::ofAllocator @ bci:0 (line 273) 
> DirectMethodHandle$Holder::invokeStatic @ bci:10 
> 0x00000008010a7400::invoke @

That's odd - I mean, the BindingContext is used when setting up downcall 
method handles, or upcall stubs. But should not be invoked in the hot path.

That said, in jdk/jdk calls that need to spill arguments on the stack 
are not intrinsified (but they are on the panama repo). I wonder if that 
is playing a role here? Typically, if you see a lot of 
ProgrammableInvoker stuff showing up in -XX:+PrintInlining, that's the 
culprit.

Maurizio

>
>
> I put there @ForceInline, but for me there's no huge impact.
>
>
> Kind regards,
>
> Rado
>
>
> On 13.12.2021 20:27, Maurizio Cimadamore wrote:
>>
>> On 13/12/2021 17:50, Rado Smogura wrote:
>>> Hi,
>>>
>>>
>>> Checked from my side, and actually on my side this code is more 
>>> complicated and there's no direct loop unrolling.
>>
>> Which code is more complicated? The one with the new patch which 
>> enabled the VM to do its job?
>>
>> If that's the case it would be great if we could boil it down to a 
>> simple-ish reproducer.
>>
>> Thanks
>> Maurizio
>>
>>>
>>>
>>> Kind regards,
>>>
>>> Rado
>>>
>>> On 13.12.2021 09:57, Rado Smogura wrote:
>>>> Definitely yes!
>>>>
>>>>
>>>> I'll check later the ASM output and graphs, to see if there's 
>>>> something which may look strange.
>>>>
>>>>
>>>> I really would like to see numbers with heap arrays pinning!
>>>>
>>>>
>>>> BR,
>>>>
>>>> Rado
>>>>
>>>>
>>>> P. S. I think I need to finish package it and put it into public 
>>>> repo - it's drop-in replacement cooperating with current JDK Socket 
>>>> factories.
>>>>
>>>> On 11.12.2021 23:30, Maurizio Cimadamore wrote:
>>>>> Thanks Rado,
>>>>> seems like we're in the same ballpark? (which is great, since 
>>>>> we're removing a lot of complexity from the implementation)
>>>>>
>>>>> (P.S. it's impressive how much faster your implementation is 
>>>>> compared to JDK sockets, in the 2nd and 3rd bench).
>>>>>
>>>>> Maurizio
>>>>>
>>>>> On 11/12/2021 16:38, Rado Smogura wrote:
>>>>>> Hi all,
>>>>>>
>>>>>>
>>>>>> Just for comparison, run against April commits
>>>>>>
>>>>>>
>>>>>> "Before"
>>>>>>
>>>>>> Benchmark                          Mode  Cnt Score Error Units
>>>>>> SocketReadJdk.teatRead4k          thrpt    5 939997.688 ± 
>>>>>> 74877.602  ops/s
>>>>>> SocketReadJdk.testRead16b         thrpt    5 1881053.005 ± 
>>>>>> 72637.626  ops/s
>>>>>> SocketReadJdk.testRead8bOffset    thrpt    5 1924527.582 ± 
>>>>>> 38308.317  ops/s
>>>>>> SocketReadPosix.teatRead4k        thrpt    5 1157621.341 ± 
>>>>>> 106649.696  ops/s
>>>>>> SocketReadPosix.testRead16b       thrpt    5 3059826.951 ± 
>>>>>> 232852.053  ops/s
>>>>>> SocketReadPosix.testRead8bOffset  thrpt    5 2983402.371 ± 
>>>>>> 269646.104  ops/s
>>>>>>
>>>>>>
>>>>>> "Current" - other runs
>>>>>> Benchmark                          Mode  Cnt Score Error Units
>>>>>> Benchmark                          Mode  Cnt Score Error Units
>>>>>> SocketReadPosix.teatRead4k        thrpt    5 1163288.078 ± 
>>>>>> 263855.161  ops/s
>>>>>> SocketReadPosix.testRead16b       thrpt    5 3118810.213 ± 
>>>>>> 68389.408  ops/s
>>>>>> SocketReadPosix.testRead8bOffset  thrpt    5 2696627.066 ± 
>>>>>> 297527.130  ops/s
>>>>>>
>>>>>> I have to point out that this benchmark is not perfect, as it 
>>>>>> really reads data from the back server, so other noise can apply.
>>>>>>
>>>>>> BR,
>>>>>>
>>>>>> Rado
>>>>>>
>>>>>>> Hi Maurizio,
>>>>>>>
>>>>>>>
>>>>>>> Checked against JExtract branch 
>>>>>>> 2617fbfa3050913d34906f87027b8be8f10e53a9
>>>>>>>
>>>>>>> Project: 
>>>>>>> https://urldefense.com/v3/__https://github.com/rsmogura/panama-io__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0FiCfiSk$ 
>>>>>>>
>>>>>>> Benchmark                          Mode  Cnt Score Error Units
>>>>>>> SocketReadJdk.teatRead4k          thrpt    5 947424.435 ± 
>>>>>>> 74922.610  ops/s
>>>>>>> SocketReadJdk.testRead16b         thrpt    5 1823338.685 ± 
>>>>>>> 33626.860  ops/s
>>>>>>> SocketReadJdk.testRead8bOffset    thrpt    5 1817956.804 ± 
>>>>>>> 25456.785  ops/s
>>>>>>> SocketReadPosix.teatRead4k        thrpt    5 1205470.257 ± 
>>>>>>> 548343.499  ops/s
>>>>>>> SocketReadPosix.testRead16b       thrpt    5 2710119.664 ± 
>>>>>>> 227053.749  ops/s
>>>>>>> SocketReadPosix.testRead8bOffset  thrpt    5 2968281.197 ± 
>>>>>>> 216628.917  ops/s
>>>>>>>
>>>>>>> Numbers look amazing - I have to check if it's still does what 
>>>>>>> it's intended to do (so write some integration test).
>>>>>>>
>>>>>>> Kind regards,
>>>>>>>
>>>>>>> Rado
>>>>>>>
>>>>>>> On 10.12.2021 23:33, Remi Forax wrote:
>>>>>>>> Hi Ty,
>>>>>>>> there is a simple trick to be sure to get the best performance.
>>>>>>>>
>>>>>>>> When you create the VarHandle, call withInvokeExactBehavior [1] 
>>>>>>>> on it,
>>>>>>>> the returned VarHandle will throw an error at runtime instead 
>>>>>>>> of trying to convert arguments.
>>>>>>>>
>>>>>>>> Rémi
>>>>>>>>
>>>>>>>> [1] 
>>>>>>>> https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/invoke/VarHandle.html#withInvokeExactBehavior()
>>>>>>>>
>>>>>>>> ----- Original Message -----
>>>>>>>>> From: "Ty Young" <youngty1997 at gmail.com>
>>>>>>>>> To: "Maurizio Cimadamore" <maurizio.cimadamore at oracle.com>, 
>>>>>>>>> "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
>>>>>>>>> Sent: Friday, December 10, 2021 11:18:45 PM
>>>>>>>>> Subject: Re: status of VM long loop optimizations - call for 
>>>>>>>>> action
>>>>>>>>> Yeah, I forgot that. Apologies.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 12/10/21 4:06 PM, Maurizio Cimadamore wrote:
>>>>>>>>>> Hi,
>>>>>>>>>> I don't think the 1ns difference is real - if you look at the 
>>>>>>>>>> error in
>>>>>>>>>> the second run is higher than that, so it's in the noise.
>>>>>>>>>>
>>>>>>>>>> And, since there's no loop, I don't think this specific kind of
>>>>>>>>>> benchmark should be affected in any way by the VM 
>>>>>>>>>> improvements. What
>>>>>>>>>> the VM can help with is to remove bound checks when you keep 
>>>>>>>>>> accessing
>>>>>>>>>> a segment in a loop, as C2 is now able to correctly apply an
>>>>>>>>>> optimization called "bound check elimination" or BCE. This
>>>>>>>>>> optimization is routinely applied on Java array access, but 
>>>>>>>>>> it used to
>>>>>>>>>> fail for memory segments because the bound of a memory 
>>>>>>>>>> segment is
>>>>>>>>>> stored in a long variable, not an int.
>>>>>>>>>>
>>>>>>>>>> That said, note that you are passing inexact arguments to the 
>>>>>>>>>> var
>>>>>>>>>> handle (e.g. you are passing an int offset instead of a long 
>>>>>>>>>> one; try
>>>>>>>>>> to use "0L" instead of "0").
>>>>>>>>>>
>>>>>>>>>> Maurizio
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 10/12/2021 21:34, Ty Young wrote:
>>>>>>>>>>> A simple write benchmark I had already made for specialized
>>>>>>>>>>> VarHandles(AKA insertCoordinates) seems to get about 1ns 
>>>>>>>>>>> consistently
>>>>>>>>>>> faster, so I guess these changes helped a bit?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Before:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Benchmark Mode Cnt Score Error
>>>>>>>>>>> Units
>>>>>>>>>>> VarHandleBenchmark.genericHandleBenchmark avgt 5 21.155 ±
>>>>>>>>>>> 0.145  ns/op
>>>>>>>>>>> VarHandleBenchmark.specFinalHandleBenchmark avgt 5 0.678 ±
>>>>>>>>>>> 0.201  ns/op
>>>>>>>>>>> VarHandleBenchmark.specHandleBenchmark avgt 5 17.323 ±
>>>>>>>>>>> 1.324  ns/op
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> After:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Benchmark Mode Cnt Score Error
>>>>>>>>>>> Units
>>>>>>>>>>> VarHandleBenchmark.genericHandleBenchmark avgt 5 20.304 ±
>>>>>>>>>>> 1.466  ns/op
>>>>>>>>>>> VarHandleBenchmark.specFinalHandleBenchmark avgt 5 0.652 ±
>>>>>>>>>>> 0.156  ns/op
>>>>>>>>>>> VarHandleBenchmark.specHandleBenchmark avgt 5 17.266 ±
>>>>>>>>>>> 1.712  ns/op
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Benchmark:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>      public static final MemorySegment SEGMENT =
>>>>>>>>>>> MemorySegment.allocateNative(ValueLayout.JAVA_INT,
>>>>>>>>>>> ResourceScope.newSharedScope());
>>>>>>>>>>>
>>>>>>>>>>>      public static final VarHandle GENERIC_HANDLE =
>>>>>>>>>>> MemoryHandles.varHandle(ValueLayout.JAVA_INT);
>>>>>>>>>>>
>>>>>>>>>>>      public static VarHandle SPEC_HANDLE =
>>>>>>>>>>> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 0);
>>>>>>>>>>>
>>>>>>>>>>>      public static final VarHandle SPEC_HANDLE_FINAL =
>>>>>>>>>>> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 0);
>>>>>>>>>>>
>>>>>>>>>>>      @Benchmark
>>>>>>>>>>>      @BenchmarkMode(Mode.AverageTime)
>>>>>>>>>>>      @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>>>>>>>      public void genericHandleBenchmark()
>>>>>>>>>>>      {
>>>>>>>>>>>          GENERIC_HANDLE.set(SEGMENT, 0, 5);
>>>>>>>>>>>      }
>>>>>>>>>>>
>>>>>>>>>>>      @Benchmark
>>>>>>>>>>>      @BenchmarkMode(Mode.AverageTime)
>>>>>>>>>>>      @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>>>>>>>      public void specHandleBenchmark()
>>>>>>>>>>>      {
>>>>>>>>>>>          SPEC_HANDLE.set(5);
>>>>>>>>>>>      }
>>>>>>>>>>>
>>>>>>>>>>>      @Benchmark
>>>>>>>>>>>      @BenchmarkMode(Mode.AverageTime)
>>>>>>>>>>>      @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>>>>>>>      public void specFinalHandleBenchmark()
>>>>>>>>>>>      {
>>>>>>>>>>>          SPEC_HANDLE_FINAL.set(5);
>>>>>>>>>>>      }
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Sort of off-topic but... I don't remember anyone saying 
>>>>>>>>>>> previously
>>>>>>>>>>> that insertCoordinates would give that big of a 
>>>>>>>>>>> difference(or any at
>>>>>>>>>>> all!) so it's surprising to me. I was expecting a performance
>>>>>>>>>>> decrease due to the handle no longer being static-final. Can 
>>>>>>>>>>> javac
>>>>>>>>>>> maybe optimize this so that any case where:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> GENERIC_HANDLE.set(SEGMENT, 0, 5);
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> is, an optimized VarHandle is created at compile time that is
>>>>>>>>>>> equivalent to SPEC_HANDLE and inserted there instead?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 12/10/21 4:55 AM, Maurizio Cimadamore wrote:
>>>>>>>>>>>> (resending since mailing lists were down yesterday - I 
>>>>>>>>>>>> apologize if
>>>>>>>>>>>> this results in duplicates).
>>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> few days ago some VM enhancements were integrated [1, 2], 
>>>>>>>>>>>> so it is
>>>>>>>>>>>> time to take a look again at where we are.
>>>>>>>>>>>>
>>>>>>>>>>>> I put together a branch which removes all workarounds (both 
>>>>>>>>>>>> for long
>>>>>>>>>>>> loops and for alignment checks):
>>>>>>>>>>>>
>>>>>>>>>>>> https://urldefense.com/v3/__https://github.com/mcimadamore/jdk/tree/long_loop_workarounds_removal__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0wgQ-IjY$ 
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I also ran memory access benchmarks before/after, to see 
>>>>>>>>>>>> what the
>>>>>>>>>>>> difference is like - here's a visual report:
>>>>>>>>>>>>
>>>>>>>>>>>> https://urldefense.com/v3/__https://jmh.morethan.io/?gists=dfa7075db33f7e6a2690ac80a64aa252,7f894f48460a6a0c9891cbe3158b43a7__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0CJV9Oz8$ 
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Overall, I think the numbers are solid. The branch w/o 
>>>>>>>>>>>> workarounds
>>>>>>>>>>>> keep up with mainline in basically all cases but one 
>>>>>>>>>>>> (UnrolledAccess
>>>>>>>>>>>> - this code pattern needs more work in the VM, but Roland 
>>>>>>>>>>>> Westrelin
>>>>>>>>>>>> has identified a possible fix for it). In some cases (parallel
>>>>>>>>>>>> tests) we see quite a big jump forward.
>>>>>>>>>>>>
>>>>>>>>>>>> I think it's hard to say how these results will translate 
>>>>>>>>>>>> in real
>>>>>>>>>>>> world - my gut feeling is that the simpler bound checking 
>>>>>>>>>>>> logic will
>>>>>>>>>>>> almost invariably result in performance improvements with more
>>>>>>>>>>>> complex code patterns, despite what synthetic benchmark 
>>>>>>>>>>>> might say
>>>>>>>>>>>> (the current logic in mainline is fragile as it has to 
>>>>>>>>>>>> guard against
>>>>>>>>>>>> integer overflow, which in turns sometimes kills BCE 
>>>>>>>>>>>> optimizations).
>>>>>>>>>>>>
>>>>>>>>>>>> So I'd be inclined to integrate these changes in 18.
>>>>>>>>>>>>
>>>>>>>>>>>> If you gave a project that works agaist the Java 18 API, it 
>>>>>>>>>>>> would be
>>>>>>>>>>>> very helpful for us if you could try it on the above branch 
>>>>>>>>>>>> and
>>>>>>>>>>>> report back. This will help us make a more informed decision.
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers
>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>
>>>>>>>>>>>> [1] - https://bugs.openjdk.java.net/browse/JDK-8276116
>>>>>>>>>>>> [2] - https://bugs.openjdk.java.net/browse/JDK-8277850
>>>>>>>>>>>>
>>>>>>>>>>>>