status of VM long loop optimizations - call for action
Rado Smogura
mail at smogura.eu
Mon Dec 13 17:50:00 UTC 2021
Hi,
Checked from my side, and actually on my side this code is more
complicated and there's no direct loop unrolling.
Kind regards,
Rado
On 13.12.2021 09:57, Rado Smogura wrote:
> Definitely yes!
>
>
> I'll check later the ASM output and graphs, to see if there's
> something which may look strange.
>
>
> I really would like to see numbers with heap arrays pinning!
>
>
> BR,
>
> Rado
>
>
> P. S. I think I need to finish package it and put it into public repo
> - it's drop-in replacement cooperating with current JDK Socket factories.
>
> On 11.12.2021 23:30, Maurizio Cimadamore wrote:
>> Thanks Rado,
>> seems like we're in the same ballpark? (which is great, since we're
>> removing a lot of complexity from the implementation)
>>
>> (P.S. it's impressive how much faster your implementation is compared
>> to JDK sockets, in the 2nd and 3rd bench).
>>
>> Maurizio
>>
>> On 11/12/2021 16:38, Rado Smogura wrote:
>>> Hi all,
>>>
>>>
>>> Just for comparison, run against April commits
>>>
>>>
>>> "Before"
>>>
>>> Benchmark Mode Cnt Score Error Units
>>> SocketReadJdk.teatRead4k thrpt 5 939997.688 ±
>>> 74877.602 ops/s
>>> SocketReadJdk.testRead16b thrpt 5 1881053.005 ±
>>> 72637.626 ops/s
>>> SocketReadJdk.testRead8bOffset thrpt 5 1924527.582 ±
>>> 38308.317 ops/s
>>> SocketReadPosix.teatRead4k thrpt 5 1157621.341 ±
>>> 106649.696 ops/s
>>> SocketReadPosix.testRead16b thrpt 5 3059826.951 ±
>>> 232852.053 ops/s
>>> SocketReadPosix.testRead8bOffset thrpt 5 2983402.371 ±
>>> 269646.104 ops/s
>>>
>>>
>>> "Current" - other runs
>>> Benchmark Mode Cnt Score Error Units
>>> Benchmark Mode Cnt Score Error Units
>>> SocketReadPosix.teatRead4k thrpt 5 1163288.078 ±
>>> 263855.161 ops/s
>>> SocketReadPosix.testRead16b thrpt 5 3118810.213 ±
>>> 68389.408 ops/s
>>> SocketReadPosix.testRead8bOffset thrpt 5 2696627.066 ±
>>> 297527.130 ops/s
>>>
>>> I have to point out that this benchmark is not perfect, as it really
>>> reads data from the back server, so other noise can apply.
>>>
>>> BR,
>>>
>>> Rado
>>>
>>>> Hi Maurizio,
>>>>
>>>>
>>>> Checked against JExtract branch
>>>> 2617fbfa3050913d34906f87027b8be8f10e53a9
>>>>
>>>> Project:
>>>> https://urldefense.com/v3/__https://github.com/rsmogura/panama-io__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0FiCfiSk$
>>>>
>>>> Benchmark Mode Cnt Score Error Units
>>>> SocketReadJdk.teatRead4k thrpt 5 947424.435 ±
>>>> 74922.610 ops/s
>>>> SocketReadJdk.testRead16b thrpt 5 1823338.685 ±
>>>> 33626.860 ops/s
>>>> SocketReadJdk.testRead8bOffset thrpt 5 1817956.804 ±
>>>> 25456.785 ops/s
>>>> SocketReadPosix.teatRead4k thrpt 5 1205470.257 ±
>>>> 548343.499 ops/s
>>>> SocketReadPosix.testRead16b thrpt 5 2710119.664 ±
>>>> 227053.749 ops/s
>>>> SocketReadPosix.testRead8bOffset thrpt 5 2968281.197 ±
>>>> 216628.917 ops/s
>>>>
>>>> Numbers look amazing - I have to check if it's still does what it's
>>>> intended to do (so write some integration test).
>>>>
>>>> Kind regards,
>>>>
>>>> Rado
>>>>
>>>> On 10.12.2021 23:33, Remi Forax wrote:
>>>>> Hi Ty,
>>>>> there is a simple trick to be sure to get the best performance.
>>>>>
>>>>> When you create the VarHandle, call withInvokeExactBehavior [1] on
>>>>> it,
>>>>> the returned VarHandle will throw an error at runtime instead of
>>>>> trying to convert arguments.
>>>>>
>>>>> Rémi
>>>>>
>>>>> [1]
>>>>> https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/invoke/VarHandle.html#withInvokeExactBehavior()
>>>>>
>>>>> ----- Original Message -----
>>>>>> From: "Ty Young" <youngty1997 at gmail.com>
>>>>>> To: "Maurizio Cimadamore" <maurizio.cimadamore at oracle.com>,
>>>>>> "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
>>>>>> Sent: Friday, December 10, 2021 11:18:45 PM
>>>>>> Subject: Re: status of VM long loop optimizations - call for action
>>>>>> Yeah, I forgot that. Apologies.
>>>>>>
>>>>>>
>>>>>> On 12/10/21 4:06 PM, Maurizio Cimadamore wrote:
>>>>>>> Hi,
>>>>>>> I don't think the 1ns difference is real - if you look at the
>>>>>>> error in
>>>>>>> the second run is higher than that, so it's in the noise.
>>>>>>>
>>>>>>> And, since there's no loop, I don't think this specific kind of
>>>>>>> benchmark should be affected in any way by the VM improvements.
>>>>>>> What
>>>>>>> the VM can help with is to remove bound checks when you keep
>>>>>>> accessing
>>>>>>> a segment in a loop, as C2 is now able to correctly apply an
>>>>>>> optimization called "bound check elimination" or BCE. This
>>>>>>> optimization is routinely applied on Java array access, but it
>>>>>>> used to
>>>>>>> fail for memory segments because the bound of a memory segment is
>>>>>>> stored in a long variable, not an int.
>>>>>>>
>>>>>>> That said, note that you are passing inexact arguments to the var
>>>>>>> handle (e.g. you are passing an int offset instead of a long
>>>>>>> one; try
>>>>>>> to use "0L" instead of "0").
>>>>>>>
>>>>>>> Maurizio
>>>>>>>
>>>>>>>
>>>>>>> On 10/12/2021 21:34, Ty Young wrote:
>>>>>>>> A simple write benchmark I had already made for specialized
>>>>>>>> VarHandles(AKA insertCoordinates) seems to get about 1ns
>>>>>>>> consistently
>>>>>>>> faster, so I guess these changes helped a bit?
>>>>>>>>
>>>>>>>>
>>>>>>>> Before:
>>>>>>>>
>>>>>>>>
>>>>>>>> Benchmark Mode Cnt Score Error
>>>>>>>> Units
>>>>>>>> VarHandleBenchmark.genericHandleBenchmark avgt 5 21.155 ±
>>>>>>>> 0.145 ns/op
>>>>>>>> VarHandleBenchmark.specFinalHandleBenchmark avgt 5 0.678 ±
>>>>>>>> 0.201 ns/op
>>>>>>>> VarHandleBenchmark.specHandleBenchmark avgt 5 17.323 ±
>>>>>>>> 1.324 ns/op
>>>>>>>>
>>>>>>>>
>>>>>>>> After:
>>>>>>>>
>>>>>>>>
>>>>>>>> Benchmark Mode Cnt Score Error
>>>>>>>> Units
>>>>>>>> VarHandleBenchmark.genericHandleBenchmark avgt 5 20.304 ±
>>>>>>>> 1.466 ns/op
>>>>>>>> VarHandleBenchmark.specFinalHandleBenchmark avgt 5 0.652 ±
>>>>>>>> 0.156 ns/op
>>>>>>>> VarHandleBenchmark.specHandleBenchmark avgt 5 17.266 ±
>>>>>>>> 1.712 ns/op
>>>>>>>>
>>>>>>>>
>>>>>>>> Benchmark:
>>>>>>>>
>>>>>>>>
>>>>>>>> public static final MemorySegment SEGMENT =
>>>>>>>> MemorySegment.allocateNative(ValueLayout.JAVA_INT,
>>>>>>>> ResourceScope.newSharedScope());
>>>>>>>>
>>>>>>>> public static final VarHandle GENERIC_HANDLE =
>>>>>>>> MemoryHandles.varHandle(ValueLayout.JAVA_INT);
>>>>>>>>
>>>>>>>> public static VarHandle SPEC_HANDLE =
>>>>>>>> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 0);
>>>>>>>>
>>>>>>>> public static final VarHandle SPEC_HANDLE_FINAL =
>>>>>>>> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 0);
>>>>>>>>
>>>>>>>> @Benchmark
>>>>>>>> @BenchmarkMode(Mode.AverageTime)
>>>>>>>> @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>>>> public void genericHandleBenchmark()
>>>>>>>> {
>>>>>>>> GENERIC_HANDLE.set(SEGMENT, 0, 5);
>>>>>>>> }
>>>>>>>>
>>>>>>>> @Benchmark
>>>>>>>> @BenchmarkMode(Mode.AverageTime)
>>>>>>>> @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>>>> public void specHandleBenchmark()
>>>>>>>> {
>>>>>>>> SPEC_HANDLE.set(5);
>>>>>>>> }
>>>>>>>>
>>>>>>>> @Benchmark
>>>>>>>> @BenchmarkMode(Mode.AverageTime)
>>>>>>>> @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>>>> public void specFinalHandleBenchmark()
>>>>>>>> {
>>>>>>>> SPEC_HANDLE_FINAL.set(5);
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> Sort of off-topic but... I don't remember anyone saying previously
>>>>>>>> that insertCoordinates would give that big of a difference(or
>>>>>>>> any at
>>>>>>>> all!) so it's surprising to me. I was expecting a performance
>>>>>>>> decrease due to the handle no longer being static-final. Can javac
>>>>>>>> maybe optimize this so that any case where:
>>>>>>>>
>>>>>>>>
>>>>>>>> GENERIC_HANDLE.set(SEGMENT, 0, 5);
>>>>>>>>
>>>>>>>>
>>>>>>>> is, an optimized VarHandle is created at compile time that is
>>>>>>>> equivalent to SPEC_HANDLE and inserted there instead?
>>>>>>>>
>>>>>>>>
>>>>>>>> On 12/10/21 4:55 AM, Maurizio Cimadamore wrote:
>>>>>>>>> (resending since mailing lists were down yesterday - I
>>>>>>>>> apologize if
>>>>>>>>> this results in duplicates).
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> few days ago some VM enhancements were integrated [1, 2], so
>>>>>>>>> it is
>>>>>>>>> time to take a look again at where we are.
>>>>>>>>>
>>>>>>>>> I put together a branch which removes all workarounds (both
>>>>>>>>> for long
>>>>>>>>> loops and for alignment checks):
>>>>>>>>>
>>>>>>>>> https://urldefense.com/v3/__https://github.com/mcimadamore/jdk/tree/long_loop_workarounds_removal__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0wgQ-IjY$
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I also ran memory access benchmarks before/after, to see what the
>>>>>>>>> difference is like - here's a visual report:
>>>>>>>>>
>>>>>>>>> https://urldefense.com/v3/__https://jmh.morethan.io/?gists=dfa7075db33f7e6a2690ac80a64aa252,7f894f48460a6a0c9891cbe3158b43a7__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0CJV9Oz8$
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Overall, I think the numbers are solid. The branch w/o
>>>>>>>>> workarounds
>>>>>>>>> keep up with mainline in basically all cases but one
>>>>>>>>> (UnrolledAccess
>>>>>>>>> - this code pattern needs more work in the VM, but Roland
>>>>>>>>> Westrelin
>>>>>>>>> has identified a possible fix for it). In some cases (parallel
>>>>>>>>> tests) we see quite a big jump forward.
>>>>>>>>>
>>>>>>>>> I think it's hard to say how these results will translate in real
>>>>>>>>> world - my gut feeling is that the simpler bound checking
>>>>>>>>> logic will
>>>>>>>>> almost invariably result in performance improvements with more
>>>>>>>>> complex code patterns, despite what synthetic benchmark might say
>>>>>>>>> (the current logic in mainline is fragile as it has to guard
>>>>>>>>> against
>>>>>>>>> integer overflow, which in turns sometimes kills BCE
>>>>>>>>> optimizations).
>>>>>>>>>
>>>>>>>>> So I'd be inclined to integrate these changes in 18.
>>>>>>>>>
>>>>>>>>> If you gave a project that works agaist the Java 18 API, it
>>>>>>>>> would be
>>>>>>>>> very helpful for us if you could try it on the above branch and
>>>>>>>>> report back. This will help us make a more informed decision.
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>> Maurizio
>>>>>>>>>
>>>>>>>>> [1] - https://bugs.openjdk.java.net/browse/JDK-8276116
>>>>>>>>> [2] - https://bugs.openjdk.java.net/browse/JDK-8277850
>>>>>>>>>
>>>>>>>>>
More information about the panama-dev
mailing list