status of VM long loop optimizations - call for action
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Sat Dec 11 22:30:01 UTC 2021
Thanks Rado,
seems like we're in the same ballpark? (which is great, since we're
removing a lot of complexity from the implementation)
(P.S. it's impressive how much faster your implementation is compared to
JDK sockets, in the 2nd and 3rd bench).
Maurizio
On 11/12/2021 16:38, Rado Smogura wrote:
> Hi all,
>
>
> Just for comparison, run against April commits
>
>
> "Before"
>
> Benchmark Mode Cnt Score Error Units
> SocketReadJdk.teatRead4k thrpt 5 939997.688 ± 74877.602
> ops/s
> SocketReadJdk.testRead16b thrpt 5 1881053.005 ± 72637.626
> ops/s
> SocketReadJdk.testRead8bOffset thrpt 5 1924527.582 ± 38308.317
> ops/s
> SocketReadPosix.teatRead4k thrpt 5 1157621.341 ±
> 106649.696 ops/s
> SocketReadPosix.testRead16b thrpt 5 3059826.951 ±
> 232852.053 ops/s
> SocketReadPosix.testRead8bOffset thrpt 5 2983402.371 ±
> 269646.104 ops/s
>
>
> "Current" - other runs
> Benchmark Mode Cnt Score Error Units
> Benchmark Mode Cnt Score Error Units
> SocketReadPosix.teatRead4k thrpt 5 1163288.078 ±
> 263855.161 ops/s
> SocketReadPosix.testRead16b thrpt 5 3118810.213 ± 68389.408
> ops/s
> SocketReadPosix.testRead8bOffset thrpt 5 2696627.066 ±
> 297527.130 ops/s
>
> I have to point out that this benchmark is not perfect, as it really
> reads data from the back server, so other noise can apply.
>
> BR,
>
> Rado
>
>> Hi Maurizio,
>>
>>
>> Checked against JExtract branch 2617fbfa3050913d34906f87027b8be8f10e53a9
>>
>> Project:
>> https://urldefense.com/v3/__https://github.com/rsmogura/panama-io__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0FiCfiSk$
>>
>> Benchmark Mode Cnt Score Error Units
>> SocketReadJdk.teatRead4k thrpt 5 947424.435 ±
>> 74922.610 ops/s
>> SocketReadJdk.testRead16b thrpt 5 1823338.685 ±
>> 33626.860 ops/s
>> SocketReadJdk.testRead8bOffset thrpt 5 1817956.804 ±
>> 25456.785 ops/s
>> SocketReadPosix.teatRead4k thrpt 5 1205470.257 ±
>> 548343.499 ops/s
>> SocketReadPosix.testRead16b thrpt 5 2710119.664 ±
>> 227053.749 ops/s
>> SocketReadPosix.testRead8bOffset thrpt 5 2968281.197 ±
>> 216628.917 ops/s
>>
>> Numbers look amazing - I have to check if it's still does what it's
>> intended to do (so write some integration test).
>>
>> Kind regards,
>>
>> Rado
>>
>> On 10.12.2021 23:33, Remi Forax wrote:
>>> Hi Ty,
>>> there is a simple trick to be sure to get the best performance.
>>>
>>> When you create the VarHandle, call withInvokeExactBehavior [1] on it,
>>> the returned VarHandle will throw an error at runtime instead of
>>> trying to convert arguments.
>>>
>>> Rémi
>>>
>>> [1]
>>> https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/invoke/VarHandle.html#withInvokeExactBehavior()
>>>
>>> ----- Original Message -----
>>>> From: "Ty Young" <youngty1997 at gmail.com>
>>>> To: "Maurizio Cimadamore" <maurizio.cimadamore at oracle.com>,
>>>> "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
>>>> Sent: Friday, December 10, 2021 11:18:45 PM
>>>> Subject: Re: status of VM long loop optimizations - call for action
>>>> Yeah, I forgot that. Apologies.
>>>>
>>>>
>>>> On 12/10/21 4:06 PM, Maurizio Cimadamore wrote:
>>>>> Hi,
>>>>> I don't think the 1ns difference is real - if you look at the
>>>>> error in
>>>>> the second run is higher than that, so it's in the noise.
>>>>>
>>>>> And, since there's no loop, I don't think this specific kind of
>>>>> benchmark should be affected in any way by the VM improvements. What
>>>>> the VM can help with is to remove bound checks when you keep
>>>>> accessing
>>>>> a segment in a loop, as C2 is now able to correctly apply an
>>>>> optimization called "bound check elimination" or BCE. This
>>>>> optimization is routinely applied on Java array access, but it
>>>>> used to
>>>>> fail for memory segments because the bound of a memory segment is
>>>>> stored in a long variable, not an int.
>>>>>
>>>>> That said, note that you are passing inexact arguments to the var
>>>>> handle (e.g. you are passing an int offset instead of a long one; try
>>>>> to use "0L" instead of "0").
>>>>>
>>>>> Maurizio
>>>>>
>>>>>
>>>>> On 10/12/2021 21:34, Ty Young wrote:
>>>>>> A simple write benchmark I had already made for specialized
>>>>>> VarHandles(AKA insertCoordinates) seems to get about 1ns
>>>>>> consistently
>>>>>> faster, so I guess these changes helped a bit?
>>>>>>
>>>>>>
>>>>>> Before:
>>>>>>
>>>>>>
>>>>>> Benchmark Mode Cnt Score Error
>>>>>> Units
>>>>>> VarHandleBenchmark.genericHandleBenchmark avgt 5 21.155 ±
>>>>>> 0.145 ns/op
>>>>>> VarHandleBenchmark.specFinalHandleBenchmark avgt 5 0.678 ±
>>>>>> 0.201 ns/op
>>>>>> VarHandleBenchmark.specHandleBenchmark avgt 5 17.323 ±
>>>>>> 1.324 ns/op
>>>>>>
>>>>>>
>>>>>> After:
>>>>>>
>>>>>>
>>>>>> Benchmark Mode Cnt Score Error
>>>>>> Units
>>>>>> VarHandleBenchmark.genericHandleBenchmark avgt 5 20.304 ±
>>>>>> 1.466 ns/op
>>>>>> VarHandleBenchmark.specFinalHandleBenchmark avgt 5 0.652 ±
>>>>>> 0.156 ns/op
>>>>>> VarHandleBenchmark.specHandleBenchmark avgt 5 17.266 ±
>>>>>> 1.712 ns/op
>>>>>>
>>>>>>
>>>>>> Benchmark:
>>>>>>
>>>>>>
>>>>>> public static final MemorySegment SEGMENT =
>>>>>> MemorySegment.allocateNative(ValueLayout.JAVA_INT,
>>>>>> ResourceScope.newSharedScope());
>>>>>>
>>>>>> public static final VarHandle GENERIC_HANDLE =
>>>>>> MemoryHandles.varHandle(ValueLayout.JAVA_INT);
>>>>>>
>>>>>> public static VarHandle SPEC_HANDLE =
>>>>>> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 0);
>>>>>>
>>>>>> public static final VarHandle SPEC_HANDLE_FINAL =
>>>>>> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 0);
>>>>>>
>>>>>> @Benchmark
>>>>>> @BenchmarkMode(Mode.AverageTime)
>>>>>> @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>> public void genericHandleBenchmark()
>>>>>> {
>>>>>> GENERIC_HANDLE.set(SEGMENT, 0, 5);
>>>>>> }
>>>>>>
>>>>>> @Benchmark
>>>>>> @BenchmarkMode(Mode.AverageTime)
>>>>>> @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>> public void specHandleBenchmark()
>>>>>> {
>>>>>> SPEC_HANDLE.set(5);
>>>>>> }
>>>>>>
>>>>>> @Benchmark
>>>>>> @BenchmarkMode(Mode.AverageTime)
>>>>>> @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>>> public void specFinalHandleBenchmark()
>>>>>> {
>>>>>> SPEC_HANDLE_FINAL.set(5);
>>>>>> }
>>>>>>
>>>>>>
>>>>>> Sort of off-topic but... I don't remember anyone saying previously
>>>>>> that insertCoordinates would give that big of a difference(or any at
>>>>>> all!) so it's surprising to me. I was expecting a performance
>>>>>> decrease due to the handle no longer being static-final. Can javac
>>>>>> maybe optimize this so that any case where:
>>>>>>
>>>>>>
>>>>>> GENERIC_HANDLE.set(SEGMENT, 0, 5);
>>>>>>
>>>>>>
>>>>>> is, an optimized VarHandle is created at compile time that is
>>>>>> equivalent to SPEC_HANDLE and inserted there instead?
>>>>>>
>>>>>>
>>>>>> On 12/10/21 4:55 AM, Maurizio Cimadamore wrote:
>>>>>>> (resending since mailing lists were down yesterday - I apologize if
>>>>>>> this results in duplicates).
>>>>>>>
>>>>>>> Hi,
>>>>>>> few days ago some VM enhancements were integrated [1, 2], so it is
>>>>>>> time to take a look again at where we are.
>>>>>>>
>>>>>>> I put together a branch which removes all workarounds (both for
>>>>>>> long
>>>>>>> loops and for alignment checks):
>>>>>>>
>>>>>>> https://urldefense.com/v3/__https://github.com/mcimadamore/jdk/tree/long_loop_workarounds_removal__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0wgQ-IjY$
>>>>>>>
>>>>>>>
>>>>>>> I also ran memory access benchmarks before/after, to see what the
>>>>>>> difference is like - here's a visual report:
>>>>>>>
>>>>>>> https://urldefense.com/v3/__https://jmh.morethan.io/?gists=dfa7075db33f7e6a2690ac80a64aa252,7f894f48460a6a0c9891cbe3158b43a7__;!!ACWV5N9M2RV99hQ!eHJDUHk-4w7ACAv4d-zx4jrdqx8ZXOjSs2e8nfl6E_dVpYHxULfx83N2zQYlpyZ0CJV9Oz8$
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Overall, I think the numbers are solid. The branch w/o workarounds
>>>>>>> keep up with mainline in basically all cases but one
>>>>>>> (UnrolledAccess
>>>>>>> - this code pattern needs more work in the VM, but Roland Westrelin
>>>>>>> has identified a possible fix for it). In some cases (parallel
>>>>>>> tests) we see quite a big jump forward.
>>>>>>>
>>>>>>> I think it's hard to say how these results will translate in real
>>>>>>> world - my gut feeling is that the simpler bound checking logic
>>>>>>> will
>>>>>>> almost invariably result in performance improvements with more
>>>>>>> complex code patterns, despite what synthetic benchmark might say
>>>>>>> (the current logic in mainline is fragile as it has to guard
>>>>>>> against
>>>>>>> integer overflow, which in turns sometimes kills BCE
>>>>>>> optimizations).
>>>>>>>
>>>>>>> So I'd be inclined to integrate these changes in 18.
>>>>>>>
>>>>>>> If you gave a project that works agaist the Java 18 API, it
>>>>>>> would be
>>>>>>> very helpful for us if you could try it on the above branch and
>>>>>>> report back. This will help us make a more informed decision.
>>>>>>>
>>>>>>> Cheers
>>>>>>> Maurizio
>>>>>>>
>>>>>>> [1] - https://bugs.openjdk.java.net/browse/JDK-8276116
>>>>>>> [2] - https://bugs.openjdk.java.net/browse/JDK-8277850
>>>>>>>
>>>>>>>
More information about the panama-dev
mailing list