status of VM long loop optimizations - call for action
Rado Smogura
mail at smogura.eu
Tue Dec 14 18:56:55 UTC 2021
Hi all,
The CallLeaf is stuff from garbage collector...
BR,
Rado
On 13.12.2021 23:10, Maurizio Cimadamore wrote:
>
> On 13/12/2021 19:49, Rado Smogura wrote:
>> It's mine part.
>>
>>
>> The PosixInputStream.read is just quite large, so it does not get
>> inlined and even if it would it uses Polling Allocator which uses has
>> synchronized code (so in any way hard to unroll)
>>
>>
>> However I've found something like this in graph
>>
>> CallLeaf
>>
>> jvms: Binding$Context::ofAllocator @ bci:0 (line 273)
>> DirectMethodHandle$Holder::invokeStatic @ bci:10
>> 0x00000008010a7400::invoke @
>
> That's odd - I mean, the BindingContext is used when setting up
> downcall method handles, or upcall stubs. But should not be invoked in
> the hot path.
>
> That said, in jdk/jdk calls that need to spill arguments on the stack
> are not intrinsified (but they are on the panama repo). I wonder if
> that is playing a role here? Typically, if you see a lot of
> ProgrammableInvoker stuff showing up in -XX:+PrintInlining, that's the
> culprit.
>
> Maurizio
>
>>
>>
>> I put there @ForceInline, but for me there's no huge impact.
>>
>>
>> Kind regards,
>>
>> Rado
>>
>>
>> On 13.12.2021 20:27, Maurizio Cimadamore wrote:
>>>
>>> On 13/12/2021 17:50, Rado Smogura wrote:
>>>> Hi,
>>>>
>>>>
>>>> Checked from my side, and actually on my side this code is more
>>>> complicated and there's no direct loop unrolling.
>>>
>>> Which code is more complicated? The one with the new patch which
>>> enabled the VM to do its job?
>>>
>>> If that's the case it would be great if we could boil it down to a
>>> simple-ish reproducer.
>>>
>>> Thanks
>>> Maurizio
>>>
>>>>
>>>>
>>>> Kind regards,
>>>>
>>>> Rado
>>>>
>>>> On 13.12.2021 09:57, Rado Smogura wrote:
>>>>> Definitely yes!
>>>>>
>>>>>
>>>>> I'll check later the ASM output and graphs, to see if there's
>>>>> something which may look strange.
>>>>>
>>>>>
>>>>> I really would like to see numbers with heap arrays pinning!
>>>>>
>>>>>
>>>>> BR,
>>>>>
>>>>> Rado
>>>>>
>>>>>
>>>>> P. S. I think I need to finish package it and put it into public
>>>>> repo - it's drop-in replacement cooperating with current JDK
>>>>> Socket factories.
>>>>>
>>>>> On 11.12.2021 23:30, Maurizio Cimadamore wrote:
>>>>>> Thanks Rado,
>>>>>> seems like we're in the same ballpark? (which is great, since
>>>>>> we're removing a lot of complexity from the implementation)
>>>>>>
>>>>>> (P.S. it's impressive how much faster your implementation is
>>>>>> compared to JDK sockets, in the 2nd and 3rd bench).
>>>>>>
>>>>>> Maurizio
>>>>>>
>>>>>> On 11/12/2021 16:38, Rado Smogura wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>>
>>>>>>> Just for comparison, run against April commits
>>>>>>>
>>>>>>>
>>>>>>> "Before"
>>>>>>>
>>>>>>> Benchmark Mode Cnt Score Error Units
>>>>>>> SocketReadJdk.teatRead4k thrpt 5 939997.688 ±
>>>>>>> 74877.602 ops/s
>>>>>>> SocketReadJdk.testRead16b thrpt 5 1881053.005 ±
>>>>>>> 72637.626 ops/s
>>>>>>> SocketReadJdk.testRead8bOffset thrpt 5 1924527.582 ±
>>>>>>> 38308.317 ops/s
>>>>>>> SocketReadPosix.teatRead4k thrpt 5 1157621.341 ±
>>>>>>> 106649.696 ops/s
>>>>>>> SocketReadPosix.testRead16b thrpt 5 3059826.951 ±
>>>>>>> 232852.053 ops/s
>>>>>>> SocketReadPosix.testRead8bOffset thrpt 5 2983402.371 ±
>>>>>>> 269646.104 ops/s
>>>>>>>
>>>>>>>
>>>>>>> "Current" - other runs
>>>>>>> Benchmark Mode Cnt Score Error Units
>>>>>>> Benchmark Mode Cnt Score Error Units
>>>>>>> SocketReadPosix.teatRead4k thrpt 5 1163288.078 ±
>>>>>>> 263855.161 ops/s
>>>>>>> SocketReadPosix.testRead16b thrpt 5 3118810.213 ±
>>>>>>> 68389.408 ops/s
>>>>>>> SocketReadPosix.testRead8bOffset thrpt 5 2696627.066 ±
>>>>>>> 297527.130 ops/s
>>>>>>>
>>>>>>> I have to point out that this benchmark is not perfect, as it
>>>>>>> really reads data from the back server, so other noise can apply.
>>>>>>>
>>>>>>> BR,
>>>>>>>
>>>>>>> Rado
>>>>>>>
>>>>>>>> Hi Maurizio,
>>>>>>>>
>>>>>>>>
>>>>>>>> Checked against JExtract branch
>>>>>>>> 2617fbfa3050913d34906f87027b8be8f10e53a9
>>>>>>>>
>>>>>>>> Project:
>>>>>>>> https://urldefense.com/v3/__https://github.com/rsmogura/panama-io__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0FiCfiSk$
>>>>>>>>
>>>>>>>> Benchmark Mode Cnt Score Error Units
>>>>>>>> SocketReadJdk.teatRead4k thrpt 5 947424.435 ±
>>>>>>>> 74922.610 ops/s
>>>>>>>> SocketReadJdk.testRead16b thrpt 5 1823338.685 ±
>>>>>>>> 33626.860 ops/s
>>>>>>>> SocketReadJdk.testRead8bOffset thrpt 5 1817956.804 ±
>>>>>>>> 25456.785 ops/s
>>>>>>>> SocketReadPosix.teatRead4k thrpt 5 1205470.257 ±
>>>>>>>> 548343.499 ops/s
>>>>>>>> SocketReadPosix.testRead16b thrpt 5 2710119.664 ±
>>>>>>>> 227053.749 ops/s
>>>>>>>> SocketReadPosix.testRead8bOffset thrpt 5 2968281.197 ±
>>>>>>>> 216628.917 ops/s
>>>>>>>>
>>>>>>>> Numbers look amazing - I have to check if it's still does what
>>>>>>>> it's intended to do (so write some integration test).
>>>>>>>>
>>>>>>>> Kind regards,
>>>>>>>>
>>>>>>>> Rado
>>>>>>>>
>>>>>>>> On 10.12.2021 23:33, Remi Forax wrote:
>>>>>>>>> Hi Ty,
>>>>>>>>> there is a simple trick to be sure to get the best performance.
>>>>>>>>>
>>>>>>>>> When you create the VarHandle, call withInvokeExactBehavior
>>>>>>>>> [1] on it,
>>>>>>>>> the returned VarHandle will throw an error at runtime instead
>>>>>>>>> of trying to convert arguments.
>>>>>>>>>
>>>>>>>>> Rémi
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/invoke/VarHandle.html#withInvokeExactBehavior()
>>>>>>>>>
>>>>>>>>> ----- Original Message -----
>>>>>>>>>> From: "Ty Young" <youngty1997 at gmail.com>
>>>>>>>>>> To: "Maurizio Cimadamore" <maurizio.cimadamore at oracle.com>,
>>>>>>>>>> "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
>>>>>>>>>> Sent: Friday, December 10, 2021 11:18:45 PM
>>>>>>>>>> Subject: Re: status of VM long loop optimizations - call for
>>>>>>>>>> action
>>>>>>>>>> Yeah, I forgot that. Apologies.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 12/10/21 4:06 PM, Maurizio Cimadamore wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>> I don't think the 1ns difference is real - if you look at
>>>>>>>>>>> the error in
>>>>>>>>>>> the second run is higher than that, so it's in the noise.
>>>>>>>>>>>
>>>>>>>>>>> And, since there's no loop, I don't think this specific kind of
>>>>>>>>>>> benchmark should be affected in any way by the VM
>>>>>>>>>>> improvements. What
>>>>>>>>>>> the VM can help with is to remove bound checks when you keep
>>>>>>>>>>> accessing
>>>>>>>>>>> a segment in a loop, as C2 is now able to correctly apply an
>>>>>>>>>>> optimization called "bound check elimination" or BCE. This
>>>>>>>>>>> optimization is routinely applied on Java array access, but
>>>>>>>>>>> it used to
>>>>>>>>>>> fail for memory segments because the bound of a memory
>>>>>>>>>>> segment is
>>>>>>>>>>> stored in a long variable, not an int.
>>>>>>>>>>>
>>>>>>>>>>> That said, note that you are passing inexact arguments to
>>>>>>>>>>> the var
>>>>>>>>>>> handle (e.g. you are passing an int offset instead of a long
>>>>>>>>>>> one; try
>>>>>>>>>>> to use "0L" instead of "0").
>>>>>>>>>>>
>>>>>>>>>>> Maurizio
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 10/12/2021 21:34, Ty Young wrote:
>>>>>>>>>>>> A simple write benchmark I had already made for specialized
>>>>>>>>>>>> VarHandles(AKA insertCoordinates) seems to get about 1ns
>>>>>>>>>>>> consistently
>>>>>>>>>>>> faster, so I guess these changes helped a bit?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Before:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Benchmark Mode Cnt Score Error
>>>>>>>>>>>> Units
>>>>>>>>>>>> VarHandleBenchmark.genericHandleBenchmark avgt 5 21.155 ±
>>>>>>>>>>>> 0.145 ns/op
>>>>>>>>>>>> VarHandleBenchmark.specFinalHandleBenchmark avgt 5 0.678 ±
>>>>>>>>>>>> 0.201 ns/op
>>>>>>>>>>>> VarHandleBenchmark.specHandleBenchmark avgt 5 17.323 ±
>>>>>>>>>>>> 1.324 ns/op
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> After:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Benchmark Mode Cnt Score Error
>>>>>>>>>>>> Units
>>>>>>>>>>>> VarHandleBenchmark.genericHandleBenchmark avgt 5 20.304 ±
>>>>>>>>>>>> 1.466 ns/op
>>>>>>>>>>>> VarHandleBenchmark.specFinalHandleBenchmark avgt 5 0.652 ±
>>>>>>>>>>>> 0.156 ns/op
>>>>>>>>>>>> VarHandleBenchmark.specHandleBenchmark avgt 5 17.266 ±
>>>>>>>>>>>> 1.712 ns/op
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Benchmark:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> public static final MemorySegment SEGMENT =
>>>>>>>>>>>> MemorySegment.allocateNative(ValueLayout.JAVA_INT,
>>>>>>>>>>>> ResourceScope.newSharedScope());
>>>>>>>>>>>>
>>>>>>>>>>>> public static final VarHandle GENERIC_HANDLE =
>>>>>>>>>>>> MemoryHandles.varHandle(ValueLayout.JAVA_INT);
>>>>>>>>>>>>
>>>>>>>>>>>> public static VarHandle SPEC_HANDLE =
>>>>>>>>>>>> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT,
>>>>>>>>>>>> 0);
>>>>>>>>>>>>
>>>>>>>>>>>> public static final VarHandle SPEC_HANDLE_FINAL =
>>>>>>>>>>>> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT,
>>>>>>>>>>>> 0);
>>>>>>>>>>>>
>>>>>>>>>>>> @Benchmark
>>>>>>>>>>>> @BenchmarkMode(Mode.AverageTime)
>>>>>>>>>>>> @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>>>>>>>> public void genericHandleBenchmark()
>>>>>>>>>>>> {
>>>>>>>>>>>> GENERIC_HANDLE.set(SEGMENT, 0, 5);
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> @Benchmark
>>>>>>>>>>>> @BenchmarkMode(Mode.AverageTime)
>>>>>>>>>>>> @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>>>>>>>> public void specHandleBenchmark()
>>>>>>>>>>>> {
>>>>>>>>>>>> SPEC_HANDLE.set(5);
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> @Benchmark
>>>>>>>>>>>> @BenchmarkMode(Mode.AverageTime)
>>>>>>>>>>>> @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>>>>>>>> public void specFinalHandleBenchmark()
>>>>>>>>>>>> {
>>>>>>>>>>>> SPEC_HANDLE_FINAL.set(5);
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Sort of off-topic but... I don't remember anyone saying
>>>>>>>>>>>> previously
>>>>>>>>>>>> that insertCoordinates would give that big of a
>>>>>>>>>>>> difference(or any at
>>>>>>>>>>>> all!) so it's surprising to me. I was expecting a performance
>>>>>>>>>>>> decrease due to the handle no longer being static-final.
>>>>>>>>>>>> Can javac
>>>>>>>>>>>> maybe optimize this so that any case where:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> GENERIC_HANDLE.set(SEGMENT, 0, 5);
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> is, an optimized VarHandle is created at compile time that is
>>>>>>>>>>>> equivalent to SPEC_HANDLE and inserted there instead?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 12/10/21 4:55 AM, Maurizio Cimadamore wrote:
>>>>>>>>>>>>> (resending since mailing lists were down yesterday - I
>>>>>>>>>>>>> apologize if
>>>>>>>>>>>>> this results in duplicates).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>> few days ago some VM enhancements were integrated [1, 2],
>>>>>>>>>>>>> so it is
>>>>>>>>>>>>> time to take a look again at where we are.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I put together a branch which removes all workarounds
>>>>>>>>>>>>> (both for long
>>>>>>>>>>>>> loops and for alignment checks):
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://urldefense.com/v3/__https://github.com/mcimadamore/jdk/tree/long_loop_workarounds_removal__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0wgQ-IjY$
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I also ran memory access benchmarks before/after, to see
>>>>>>>>>>>>> what the
>>>>>>>>>>>>> difference is like - here's a visual report:
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://urldefense.com/v3/__https://jmh.morethan.io/?gists=dfa7075db33f7e6a2690ac80a64aa252,7f894f48460a6a0c9891cbe3158b43a7__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0CJV9Oz8$
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Overall, I think the numbers are solid. The branch w/o
>>>>>>>>>>>>> workarounds
>>>>>>>>>>>>> keep up with mainline in basically all cases but one
>>>>>>>>>>>>> (UnrolledAccess
>>>>>>>>>>>>> - this code pattern needs more work in the VM, but Roland
>>>>>>>>>>>>> Westrelin
>>>>>>>>>>>>> has identified a possible fix for it). In some cases
>>>>>>>>>>>>> (parallel
>>>>>>>>>>>>> tests) we see quite a big jump forward.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think it's hard to say how these results will translate
>>>>>>>>>>>>> in real
>>>>>>>>>>>>> world - my gut feeling is that the simpler bound checking
>>>>>>>>>>>>> logic will
>>>>>>>>>>>>> almost invariably result in performance improvements with
>>>>>>>>>>>>> more
>>>>>>>>>>>>> complex code patterns, despite what synthetic benchmark
>>>>>>>>>>>>> might say
>>>>>>>>>>>>> (the current logic in mainline is fragile as it has to
>>>>>>>>>>>>> guard against
>>>>>>>>>>>>> integer overflow, which in turns sometimes kills BCE
>>>>>>>>>>>>> optimizations).
>>>>>>>>>>>>>
>>>>>>>>>>>>> So I'd be inclined to integrate these changes in 18.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If you gave a project that works agaist the Java 18 API,
>>>>>>>>>>>>> it would be
>>>>>>>>>>>>> very helpful for us if you could try it on the above
>>>>>>>>>>>>> branch and
>>>>>>>>>>>>> report back. This will help us make a more informed decision.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers
>>>>>>>>>>>>> Maurizio
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1] - https://bugs.openjdk.java.net/browse/JDK-8276116
>>>>>>>>>>>>> [2] - https://bugs.openjdk.java.net/browse/JDK-8277850
>>>>>>>>>>>>>
>>>>>>>>>>>>>
More information about the panama-dev
mailing list