status of VM long loop optimizations - call for action
Rado Smogura
mail at smogura.eu
Mon Dec 13 08:57:01 UTC 2021
Definitely yes!
I'll check later the ASM output and graphs, to see if there's something
which may look strange.
I really would like to see numbers with heap arrays pinning!
BR,
Rado
P. S. I think I need to finish package it and put it into public repo -
it's drop-in replacement cooperating with current JDK Socket factories.
On 11.12.2021 23:30, Maurizio Cimadamore wrote:
> Thanks Rado,
> seems like we're in the same ballpark? (which is great, since we're
> removing a lot of complexity from the implementation)
>
> (P.S. it's impressive how much faster your implementation is compared
> to JDK sockets, in the 2nd and 3rd bench).
>
> Maurizio
>
> On 11/12/2021 16:38, Rado Smogura wrote:
>> Hi all,
>>
>>
>> Just for comparison, run against April commits
>>
>>
>> "Before"
>>
>> Benchmark Mode Cnt Score Error Units
>> SocketReadJdk.teatRead4k thrpt 5 939997.688 ±
>> 74877.602 ops/s
>> SocketReadJdk.testRead16b thrpt 5 1881053.005 ±
>> 72637.626 ops/s
>> SocketReadJdk.testRead8bOffset thrpt 5 1924527.582 ±
>> 38308.317 ops/s
>> SocketReadPosix.teatRead4k thrpt 5 1157621.341 ±
>> 106649.696 ops/s
>> SocketReadPosix.testRead16b thrpt 5 3059826.951 ±
>> 232852.053 ops/s
>> SocketReadPosix.testRead8bOffset thrpt 5 2983402.371 ±
>> 269646.104 ops/s
>>
>>
>> "Current" - other runs
>> Benchmark Mode Cnt Score Error Units
>> Benchmark Mode Cnt Score Error Units
>> SocketReadPosix.teatRead4k thrpt 5 1163288.078 ±
>> 263855.161 ops/s
>> SocketReadPosix.testRead16b thrpt 5 3118810.213 ±
>> 68389.408 ops/s
>> SocketReadPosix.testRead8bOffset thrpt 5 2696627.066 ±
>> 297527.130 ops/s
>>
>> I have to point out that this benchmark is not perfect, as it really
>> reads data from the back server, so other noise can apply.
>>
>> BR,
>>
>> Rado
>>
>>> Hi Maurizio,
>>>
>>>
>>> Checked against JExtract branch
>>> 2617fbfa3050913d34906f87027b8be8f10e53a9
>>>
>>> Project:
>>> https://urldefense.com/v3/__https://github.com/rsmogura/panama-io__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0FiCfiSk$
>>>
>>> Benchmark Mode Cnt Score Error Units
>>> SocketReadJdk.teatRead4k thrpt 5 947424.435 ±
>>> 74922.610 ops/s
>>> SocketReadJdk.testRead16b thrpt 5 1823338.685 ±
>>> 33626.860 ops/s
>>> SocketReadJdk.testRead8bOffset thrpt 5 1817956.804 ±
>>> 25456.785 ops/s
>>> SocketReadPosix.teatRead4k thrpt 5 1205470.257 ±
>>> 548343.499 ops/s
>>> SocketReadPosix.testRead16b thrpt 5 2710119.664 ±
>>> 227053.749 ops/s
>>> SocketReadPosix.testRead8bOffset thrpt 5 2968281.197 ±
>>> 216628.917 ops/s
>>>
>>> Numbers look amazing - I have to check if it's still does what it's
>>> intended to do (so write some integration test).
>>>
>>> Kind regards,
>>>
>>> Rado
>>>
>>> On 10.12.2021 23:33, Remi Forax wrote:
>>>> Hi Ty,
>>>> there is a simple trick to be sure to get the best performance.
>>>>
>>>> When you create the VarHandle, call withInvokeExactBehavior [1] on it,
>>>> the returned VarHandle will throw an error at runtime instead of
>>>> trying to convert arguments.
>>>>
>>>> Rémi
>>>>
>>>> [1]
>>>> https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/invoke/VarHandle.html#withInvokeExactBehavior()
>>>>
>>>> ----- Original Message -----
>>>>> From: "Ty Young" <youngty1997 at gmail.com>
>>>>> To: "Maurizio Cimadamore" <maurizio.cimadamore at oracle.com>,
>>>>> "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
>>>>> Sent: Friday, December 10, 2021 11:18:45 PM
>>>>> Subject: Re: status of VM long loop optimizations - call for action
>>>>> Yeah, I forgot that. Apologies.
>>>>>
>>>>>
>>>>> On 12/10/21 4:06 PM, Maurizio Cimadamore wrote:
>>>>>> Hi,
>>>>>> I don't think the 1ns difference is real - if you look at the
>>>>>> error in
>>>>>> the second run is higher than that, so it's in the noise.
>>>>>>
>>>>>> And, since there's no loop, I don't think this specific kind of
>>>>>> benchmark should be affected in any way by the VM improvements. What
>>>>>> the VM can help with is to remove bound checks when you keep
>>>>>> accessing
>>>>>> a segment in a loop, as C2 is now able to correctly apply an
>>>>>> optimization called "bound check elimination" or BCE. This
>>>>>> optimization is routinely applied on Java array access, but it
>>>>>> used to
>>>>>> fail for memory segments because the bound of a memory segment is
>>>>>> stored in a long variable, not an int.
>>>>>>
>>>>>> That said, note that you are passing inexact arguments to the var
>>>>>> handle (e.g. you are passing an int offset instead of a long one;
>>>>>> try
>>>>>> to use "0L" instead of "0").
>>>>>>
>>>>>> Maurizio
>>>>>>
>>>>>>
>>>>>> On 10/12/2021 21:34, Ty Young wrote:
>>>>>>> A simple write benchmark I had already made for specialized
>>>>>>> VarHandles(AKA insertCoordinates) seems to get about 1ns
>>>>>>> consistently
>>>>>>> faster, so I guess these changes helped a bit?
>>>>>>>
>>>>>>>
>>>>>>> Before:
>>>>>>>
>>>>>>>
>>>>>>> Benchmark Mode Cnt Score Error
>>>>>>> Units
>>>>>>> VarHandleBenchmark.genericHandleBenchmark avgt 5 21.155 ±
>>>>>>> 0.145 ns/op
>>>>>>> VarHandleBenchmark.specFinalHandleBenchmark avgt 5 0.678 ±
>>>>>>> 0.201 ns/op
>>>>>>> VarHandleBenchmark.specHandleBenchmark avgt 5 17.323 ±
>>>>>>> 1.324 ns/op
>>>>>>>
>>>>>>>
>>>>>>> After:
>>>>>>>
>>>>>>>
>>>>>>> Benchmark Mode Cnt Score Error
>>>>>>> Units
>>>>>>> VarHandleBenchmark.genericHandleBenchmark avgt 5 20.304 ±
>>>>>>> 1.466 ns/op
>>>>>>> VarHandleBenchmark.specFinalHandleBenchmark avgt 5 0.652 ±
>>>>>>> 0.156 ns/op
>>>>>>> VarHandleBenchmark.specHandleBenchmark avgt 5 17.266 ±
>>>>>>> 1.712 ns/op
>>>>>>>
>>>>>>>
>>>>>>> Benchmark:
>>>>>>>
>>>>>>>
>>>>>>> public static final MemorySegment SEGMENT =
>>>>>>> MemorySegment.allocateNative(ValueLayout.JAVA_INT,
>>>>>>> ResourceScope.newSharedScope());
>>>>>>>
>>>>>>> public static final VarHandle GENERIC_HANDLE =
>>>>>>> MemoryHandles.varHandle(ValueLayout.JAVA_INT);
>>>>>>>
>>>>>>> public static VarHandle SPEC_HANDLE =
>>>>>>> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 0);
>>>>>>>
>>>>>>> public static final VarHandle SPEC_HANDLE_FINAL =
>>>>>>> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 0);
>>>>>>>
>>>>>>> @Benchmark
>>>>>>> @BenchmarkMode(Mode.AverageTime)
>>>>>>> @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>>> public void genericHandleBenchmark()
>>>>>>> {
>>>>>>> GENERIC_HANDLE.set(SEGMENT, 0, 5);
>>>>>>> }
>>>>>>>
>>>>>>> @Benchmark
>>>>>>> @BenchmarkMode(Mode.AverageTime)
>>>>>>> @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>>> public void specHandleBenchmark()
>>>>>>> {
>>>>>>> SPEC_HANDLE.set(5);
>>>>>>> }
>>>>>>>
>>>>>>> @Benchmark
>>>>>>> @BenchmarkMode(Mode.AverageTime)
>>>>>>> @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>>> public void specFinalHandleBenchmark()
>>>>>>> {
>>>>>>> SPEC_HANDLE_FINAL.set(5);
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> Sort of off-topic but... I don't remember anyone saying previously
>>>>>>> that insertCoordinates would give that big of a difference(or
>>>>>>> any at
>>>>>>> all!) so it's surprising to me. I was expecting a performance
>>>>>>> decrease due to the handle no longer being static-final. Can javac
>>>>>>> maybe optimize this so that any case where:
>>>>>>>
>>>>>>>
>>>>>>> GENERIC_HANDLE.set(SEGMENT, 0, 5);
>>>>>>>
>>>>>>>
>>>>>>> is, an optimized VarHandle is created at compile time that is
>>>>>>> equivalent to SPEC_HANDLE and inserted there instead?
>>>>>>>
>>>>>>>
>>>>>>> On 12/10/21 4:55 AM, Maurizio Cimadamore wrote:
>>>>>>>> (resending since mailing lists were down yesterday - I
>>>>>>>> apologize if
>>>>>>>> this results in duplicates).
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>> few days ago some VM enhancements were integrated [1, 2], so it is
>>>>>>>> time to take a look again at where we are.
>>>>>>>>
>>>>>>>> I put together a branch which removes all workarounds (both for
>>>>>>>> long
>>>>>>>> loops and for alignment checks):
>>>>>>>>
>>>>>>>> https://urldefense.com/v3/__https://github.com/mcimadamore/jdk/tree/long_loop_workarounds_removal__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0wgQ-IjY$
>>>>>>>>
>>>>>>>>
>>>>>>>> I also ran memory access benchmarks before/after, to see what the
>>>>>>>> difference is like - here's a visual report:
>>>>>>>>
>>>>>>>> https://urldefense.com/v3/__https://jmh.morethan.io/?gists=dfa7075db33f7e6a2690ac80a64aa252,7f894f48460a6a0c9891cbe3158b43a7__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0CJV9Oz8$
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Overall, I think the numbers are solid. The branch w/o workarounds
>>>>>>>> keep up with mainline in basically all cases but one
>>>>>>>> (UnrolledAccess
>>>>>>>> - this code pattern needs more work in the VM, but Roland
>>>>>>>> Westrelin
>>>>>>>> has identified a possible fix for it). In some cases (parallel
>>>>>>>> tests) we see quite a big jump forward.
>>>>>>>>
>>>>>>>> I think it's hard to say how these results will translate in real
>>>>>>>> world - my gut feeling is that the simpler bound checking logic
>>>>>>>> will
>>>>>>>> almost invariably result in performance improvements with more
>>>>>>>> complex code patterns, despite what synthetic benchmark might say
>>>>>>>> (the current logic in mainline is fragile as it has to guard
>>>>>>>> against
>>>>>>>> integer overflow, which in turns sometimes kills BCE
>>>>>>>> optimizations).
>>>>>>>>
>>>>>>>> So I'd be inclined to integrate these changes in 18.
>>>>>>>>
>>>>>>>> If you gave a project that works agaist the Java 18 API, it
>>>>>>>> would be
>>>>>>>> very helpful for us if you could try it on the above branch and
>>>>>>>> report back. This will help us make a more informed decision.
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>> Maurizio
>>>>>>>>
>>>>>>>> [1] - https://bugs.openjdk.java.net/browse/JDK-8276116
>>>>>>>> [2] - https://bugs.openjdk.java.net/browse/JDK-8277850
>>>>>>>>
>>>>>>>>
More information about the panama-dev
mailing list