status of VM long loop optimizations - call for action
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Mon Dec 13 19:27:13 UTC 2021
On 13/12/2021 17:50, Rado Smogura wrote:
> Hi,
>
>
> Checked from my side, and actually on my side this code is more
> complicated and there's no direct loop unrolling.
Which code is more complicated? The one with the new patch which enabled
the VM to do its job?
If that's the case it would be great if we could boil it down to a
simple-ish reproducer.
Thanks
Maurizio
>
>
> Kind regards,
>
> Rado
>
> On 13.12.2021 09:57, Rado Smogura wrote:
>> Definitely yes!
>>
>>
>> I'll check later the ASM output and graphs, to see if there's
>> something which may look strange.
>>
>>
>> I really would like to see numbers with heap arrays pinning!
>>
>>
>> BR,
>>
>> Rado
>>
>>
>> P. S. I think I need to finish package it and put it into public repo
>> - it's drop-in replacement cooperating with current JDK Socket
>> factories.
>>
>> On 11.12.2021 23:30, Maurizio Cimadamore wrote:
>>> Thanks Rado,
>>> seems like we're in the same ballpark? (which is great, since we're
>>> removing a lot of complexity from the implementation)
>>>
>>> (P.S. it's impressive how much faster your implementation is
>>> compared to JDK sockets, in the 2nd and 3rd bench).
>>>
>>> Maurizio
>>>
>>> On 11/12/2021 16:38, Rado Smogura wrote:
>>>> Hi all,
>>>>
>>>>
>>>> Just for comparison, run against April commits
>>>>
>>>>
>>>> "Before"
>>>>
>>>> Benchmark Mode Cnt Score Error Units
>>>> SocketReadJdk.teatRead4k thrpt 5 939997.688 ±
>>>> 74877.602 ops/s
>>>> SocketReadJdk.testRead16b thrpt 5 1881053.005 ±
>>>> 72637.626 ops/s
>>>> SocketReadJdk.testRead8bOffset thrpt 5 1924527.582 ±
>>>> 38308.317 ops/s
>>>> SocketReadPosix.teatRead4k thrpt 5 1157621.341 ±
>>>> 106649.696 ops/s
>>>> SocketReadPosix.testRead16b thrpt 5 3059826.951 ±
>>>> 232852.053 ops/s
>>>> SocketReadPosix.testRead8bOffset thrpt 5 2983402.371 ±
>>>> 269646.104 ops/s
>>>>
>>>>
>>>> "Current" - other runs
>>>> Benchmark Mode Cnt Score Error Units
>>>> Benchmark Mode Cnt Score Error Units
>>>> SocketReadPosix.teatRead4k thrpt 5 1163288.078 ±
>>>> 263855.161 ops/s
>>>> SocketReadPosix.testRead16b thrpt 5 3118810.213 ±
>>>> 68389.408 ops/s
>>>> SocketReadPosix.testRead8bOffset thrpt 5 2696627.066 ±
>>>> 297527.130 ops/s
>>>>
>>>> I have to point out that this benchmark is not perfect, as it
>>>> really reads data from the back server, so other noise can apply.
>>>>
>>>> BR,
>>>>
>>>> Rado
>>>>
>>>>> Hi Maurizio,
>>>>>
>>>>>
>>>>> Checked against JExtract branch
>>>>> 2617fbfa3050913d34906f87027b8be8f10e53a9
>>>>>
>>>>> Project:
>>>>> https://urldefense.com/v3/__https://github.com/rsmogura/panama-io__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0FiCfiSk$
>>>>>
>>>>> Benchmark Mode Cnt Score Error Units
>>>>> SocketReadJdk.teatRead4k thrpt 5 947424.435 ±
>>>>> 74922.610 ops/s
>>>>> SocketReadJdk.testRead16b thrpt 5 1823338.685 ±
>>>>> 33626.860 ops/s
>>>>> SocketReadJdk.testRead8bOffset thrpt 5 1817956.804 ±
>>>>> 25456.785 ops/s
>>>>> SocketReadPosix.teatRead4k thrpt 5 1205470.257 ±
>>>>> 548343.499 ops/s
>>>>> SocketReadPosix.testRead16b thrpt 5 2710119.664 ±
>>>>> 227053.749 ops/s
>>>>> SocketReadPosix.testRead8bOffset thrpt 5 2968281.197 ±
>>>>> 216628.917 ops/s
>>>>>
>>>>> Numbers look amazing - I have to check if it's still does what
>>>>> it's intended to do (so write some integration test).
>>>>>
>>>>> Kind regards,
>>>>>
>>>>> Rado
>>>>>
>>>>> On 10.12.2021 23:33, Remi Forax wrote:
>>>>>> Hi Ty,
>>>>>> there is a simple trick to be sure to get the best performance.
>>>>>>
>>>>>> When you create the VarHandle, call withInvokeExactBehavior [1]
>>>>>> on it,
>>>>>> the returned VarHandle will throw an error at runtime instead of
>>>>>> trying to convert arguments.
>>>>>>
>>>>>> Rémi
>>>>>>
>>>>>> [1]
>>>>>> https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/invoke/VarHandle.html#withInvokeExactBehavior()
>>>>>>
>>>>>> ----- Original Message -----
>>>>>>> From: "Ty Young" <youngty1997 at gmail.com>
>>>>>>> To: "Maurizio Cimadamore" <maurizio.cimadamore at oracle.com>,
>>>>>>> "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
>>>>>>> Sent: Friday, December 10, 2021 11:18:45 PM
>>>>>>> Subject: Re: status of VM long loop optimizations - call for action
>>>>>>> Yeah, I forgot that. Apologies.
>>>>>>>
>>>>>>>
>>>>>>> On 12/10/21 4:06 PM, Maurizio Cimadamore wrote:
>>>>>>>> Hi,
>>>>>>>> I don't think the 1ns difference is real - if you look at the
>>>>>>>> error in
>>>>>>>> the second run is higher than that, so it's in the noise.
>>>>>>>>
>>>>>>>> And, since there's no loop, I don't think this specific kind of
>>>>>>>> benchmark should be affected in any way by the VM improvements.
>>>>>>>> What
>>>>>>>> the VM can help with is to remove bound checks when you keep
>>>>>>>> accessing
>>>>>>>> a segment in a loop, as C2 is now able to correctly apply an
>>>>>>>> optimization called "bound check elimination" or BCE. This
>>>>>>>> optimization is routinely applied on Java array access, but it
>>>>>>>> used to
>>>>>>>> fail for memory segments because the bound of a memory segment is
>>>>>>>> stored in a long variable, not an int.
>>>>>>>>
>>>>>>>> That said, note that you are passing inexact arguments to the var
>>>>>>>> handle (e.g. you are passing an int offset instead of a long
>>>>>>>> one; try
>>>>>>>> to use "0L" instead of "0").
>>>>>>>>
>>>>>>>> Maurizio
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/12/2021 21:34, Ty Young wrote:
>>>>>>>>> A simple write benchmark I had already made for specialized
>>>>>>>>> VarHandles(AKA insertCoordinates) seems to get about 1ns
>>>>>>>>> consistently
>>>>>>>>> faster, so I guess these changes helped a bit?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Before:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Benchmark Mode Cnt Score Error
>>>>>>>>> Units
>>>>>>>>> VarHandleBenchmark.genericHandleBenchmark avgt 5 21.155 ±
>>>>>>>>> 0.145 ns/op
>>>>>>>>> VarHandleBenchmark.specFinalHandleBenchmark avgt 5 0.678 ±
>>>>>>>>> 0.201 ns/op
>>>>>>>>> VarHandleBenchmark.specHandleBenchmark avgt 5 17.323 ±
>>>>>>>>> 1.324 ns/op
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> After:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Benchmark Mode Cnt Score Error
>>>>>>>>> Units
>>>>>>>>> VarHandleBenchmark.genericHandleBenchmark avgt 5 20.304 ±
>>>>>>>>> 1.466 ns/op
>>>>>>>>> VarHandleBenchmark.specFinalHandleBenchmark avgt 5 0.652 ±
>>>>>>>>> 0.156 ns/op
>>>>>>>>> VarHandleBenchmark.specHandleBenchmark avgt 5 17.266 ±
>>>>>>>>> 1.712 ns/op
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Benchmark:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> public static final MemorySegment SEGMENT =
>>>>>>>>> MemorySegment.allocateNative(ValueLayout.JAVA_INT,
>>>>>>>>> ResourceScope.newSharedScope());
>>>>>>>>>
>>>>>>>>> public static final VarHandle GENERIC_HANDLE =
>>>>>>>>> MemoryHandles.varHandle(ValueLayout.JAVA_INT);
>>>>>>>>>
>>>>>>>>> public static VarHandle SPEC_HANDLE =
>>>>>>>>> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 0);
>>>>>>>>>
>>>>>>>>> public static final VarHandle SPEC_HANDLE_FINAL =
>>>>>>>>> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 0);
>>>>>>>>>
>>>>>>>>> @Benchmark
>>>>>>>>> @BenchmarkMode(Mode.AverageTime)
>>>>>>>>> @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>>>>> public void genericHandleBenchmark()
>>>>>>>>> {
>>>>>>>>> GENERIC_HANDLE.set(SEGMENT, 0, 5);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> @Benchmark
>>>>>>>>> @BenchmarkMode(Mode.AverageTime)
>>>>>>>>> @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>>>>> public void specHandleBenchmark()
>>>>>>>>> {
>>>>>>>>> SPEC_HANDLE.set(5);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> @Benchmark
>>>>>>>>> @BenchmarkMode(Mode.AverageTime)
>>>>>>>>> @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>>>>> public void specFinalHandleBenchmark()
>>>>>>>>> {
>>>>>>>>> SPEC_HANDLE_FINAL.set(5);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Sort of off-topic but... I don't remember anyone saying
>>>>>>>>> previously
>>>>>>>>> that insertCoordinates would give that big of a difference(or
>>>>>>>>> any at
>>>>>>>>> all!) so it's surprising to me. I was expecting a performance
>>>>>>>>> decrease due to the handle no longer being static-final. Can
>>>>>>>>> javac
>>>>>>>>> maybe optimize this so that any case where:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> GENERIC_HANDLE.set(SEGMENT, 0, 5);
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> is, an optimized VarHandle is created at compile time that is
>>>>>>>>> equivalent to SPEC_HANDLE and inserted there instead?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 12/10/21 4:55 AM, Maurizio Cimadamore wrote:
>>>>>>>>>> (resending since mailing lists were down yesterday - I
>>>>>>>>>> apologize if
>>>>>>>>>> this results in duplicates).
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>> few days ago some VM enhancements were integrated [1, 2], so
>>>>>>>>>> it is
>>>>>>>>>> time to take a look again at where we are.
>>>>>>>>>>
>>>>>>>>>> I put together a branch which removes all workarounds (both
>>>>>>>>>> for long
>>>>>>>>>> loops and for alignment checks):
>>>>>>>>>>
>>>>>>>>>> https://urldefense.com/v3/__https://github.com/mcimadamore/jdk/tree/long_loop_workarounds_removal__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0wgQ-IjY$
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I also ran memory access benchmarks before/after, to see what
>>>>>>>>>> the
>>>>>>>>>> difference is like - here's a visual report:
>>>>>>>>>>
>>>>>>>>>> https://urldefense.com/v3/__https://jmh.morethan.io/?gists=dfa7075db33f7e6a2690ac80a64aa252,7f894f48460a6a0c9891cbe3158b43a7__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0CJV9Oz8$
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Overall, I think the numbers are solid. The branch w/o
>>>>>>>>>> workarounds
>>>>>>>>>> keep up with mainline in basically all cases but one
>>>>>>>>>> (UnrolledAccess
>>>>>>>>>> - this code pattern needs more work in the VM, but Roland
>>>>>>>>>> Westrelin
>>>>>>>>>> has identified a possible fix for it). In some cases (parallel
>>>>>>>>>> tests) we see quite a big jump forward.
>>>>>>>>>>
>>>>>>>>>> I think it's hard to say how these results will translate in
>>>>>>>>>> real
>>>>>>>>>> world - my gut feeling is that the simpler bound checking
>>>>>>>>>> logic will
>>>>>>>>>> almost invariably result in performance improvements with more
>>>>>>>>>> complex code patterns, despite what synthetic benchmark might
>>>>>>>>>> say
>>>>>>>>>> (the current logic in mainline is fragile as it has to guard
>>>>>>>>>> against
>>>>>>>>>> integer overflow, which in turns sometimes kills BCE
>>>>>>>>>> optimizations).
>>>>>>>>>>
>>>>>>>>>> So I'd be inclined to integrate these changes in 18.
>>>>>>>>>>
>>>>>>>>>> If you gave a project that works agaist the Java 18 API, it
>>>>>>>>>> would be
>>>>>>>>>> very helpful for us if you could try it on the above branch and
>>>>>>>>>> report back. This will help us make a more informed decision.
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>> Maurizio
>>>>>>>>>>
>>>>>>>>>> [1] - https://bugs.openjdk.java.net/browse/JDK-8276116
>>>>>>>>>> [2] - https://bugs.openjdk.java.net/browse/JDK-8277850
>>>>>>>>>>
>>>>>>>>>>
More information about the panama-dev
mailing list