status of VM long loop optimizations - call for action
Rado Smogura
mail at smogura.eu
Sat Dec 11 16:38:00 UTC 2021
Hi all,
Just for comparison, run against April commits
"Before"
Benchmark Mode Cnt Score Error Units
SocketReadJdk.teatRead4k thrpt 5 939997.688 ± 74877.602 ops/s
SocketReadJdk.testRead16b thrpt 5 1881053.005 ± 72637.626 ops/s
SocketReadJdk.testRead8bOffset thrpt 5 1924527.582 ± 38308.317 ops/s
SocketReadPosix.teatRead4k thrpt 5 1157621.341 ± 106649.696
ops/s
SocketReadPosix.testRead16b thrpt 5 3059826.951 ± 232852.053
ops/s
SocketReadPosix.testRead8bOffset thrpt 5 2983402.371 ± 269646.104
ops/s
"Current" - other runs
Benchmark Mode Cnt Score Error Units
Benchmark Mode Cnt Score Error Units
SocketReadPosix.teatRead4k thrpt 5 1163288.078 ± 263855.161
ops/s
SocketReadPosix.testRead16b thrpt 5 3118810.213 ± 68389.408 ops/s
SocketReadPosix.testRead8bOffset thrpt 5 2696627.066 ± 297527.130
ops/s
I have to point out that this benchmark is not perfect, as it really
reads data from the back server, so other noise can apply.
BR,
Rado
> Hi Maurizio,
>
>
> Checked against JExtract branch 2617fbfa3050913d34906f87027b8be8f10e53a9
>
> Project: https://github.com/rsmogura/panama-io
>
> Benchmark Mode Cnt Score Error Units
> SocketReadJdk.teatRead4k thrpt 5 947424.435 ± 74922.610
> ops/s
> SocketReadJdk.testRead16b thrpt 5 1823338.685 ± 33626.860
> ops/s
> SocketReadJdk.testRead8bOffset thrpt 5 1817956.804 ± 25456.785
> ops/s
> SocketReadPosix.teatRead4k thrpt 5 1205470.257 ±
> 548343.499 ops/s
> SocketReadPosix.testRead16b thrpt 5 2710119.664 ±
> 227053.749 ops/s
> SocketReadPosix.testRead8bOffset thrpt 5 2968281.197 ±
> 216628.917 ops/s
>
> Numbers look amazing - I have to check if it's still does what it's
> intended to do (so write some integration test).
>
> Kind regards,
>
> Rado
>
> On 10.12.2021 23:33, Remi Forax wrote:
>> Hi Ty,
>> there is a simple trick to be sure to get the best performance.
>>
>> When you create the VarHandle, call withInvokeExactBehavior [1] on it,
>> the returned VarHandle will throw an error at runtime instead of
>> trying to convert arguments.
>>
>> Rémi
>>
>> [1]
>> https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/invoke/VarHandle.html#withInvokeExactBehavior()
>>
>> ----- Original Message -----
>>> From: "Ty Young" <youngty1997 at gmail.com>
>>> To: "Maurizio Cimadamore" <maurizio.cimadamore at oracle.com>,
>>> "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
>>> Sent: Friday, December 10, 2021 11:18:45 PM
>>> Subject: Re: status of VM long loop optimizations - call for action
>>> Yeah, I forgot that. Apologies.
>>>
>>>
>>> On 12/10/21 4:06 PM, Maurizio Cimadamore wrote:
>>>> Hi,
>>>> I don't think the 1ns difference is real - if you look at the error in
>>>> the second run is higher than that, so it's in the noise.
>>>>
>>>> And, since there's no loop, I don't think this specific kind of
>>>> benchmark should be affected in any way by the VM improvements. What
>>>> the VM can help with is to remove bound checks when you keep accessing
>>>> a segment in a loop, as C2 is now able to correctly apply an
>>>> optimization called "bound check elimination" or BCE. This
>>>> optimization is routinely applied on Java array access, but it used to
>>>> fail for memory segments because the bound of a memory segment is
>>>> stored in a long variable, not an int.
>>>>
>>>> That said, note that you are passing inexact arguments to the var
>>>> handle (e.g. you are passing an int offset instead of a long one; try
>>>> to use "0L" instead of "0").
>>>>
>>>> Maurizio
>>>>
>>>>
>>>> On 10/12/2021 21:34, Ty Young wrote:
>>>>> A simple write benchmark I had already made for specialized
>>>>> VarHandles(AKA insertCoordinates) seems to get about 1ns consistently
>>>>> faster, so I guess these changes helped a bit?
>>>>>
>>>>>
>>>>> Before:
>>>>>
>>>>>
>>>>> Benchmark Mode Cnt Score Error
>>>>> Units
>>>>> VarHandleBenchmark.genericHandleBenchmark avgt 5 21.155 ±
>>>>> 0.145 ns/op
>>>>> VarHandleBenchmark.specFinalHandleBenchmark avgt 5 0.678 ±
>>>>> 0.201 ns/op
>>>>> VarHandleBenchmark.specHandleBenchmark avgt 5 17.323 ±
>>>>> 1.324 ns/op
>>>>>
>>>>>
>>>>> After:
>>>>>
>>>>>
>>>>> Benchmark Mode Cnt Score Error
>>>>> Units
>>>>> VarHandleBenchmark.genericHandleBenchmark avgt 5 20.304 ±
>>>>> 1.466 ns/op
>>>>> VarHandleBenchmark.specFinalHandleBenchmark avgt 5 0.652 ±
>>>>> 0.156 ns/op
>>>>> VarHandleBenchmark.specHandleBenchmark avgt 5 17.266 ±
>>>>> 1.712 ns/op
>>>>>
>>>>>
>>>>> Benchmark:
>>>>>
>>>>>
>>>>> public static final MemorySegment SEGMENT =
>>>>> MemorySegment.allocateNative(ValueLayout.JAVA_INT,
>>>>> ResourceScope.newSharedScope());
>>>>>
>>>>> public static final VarHandle GENERIC_HANDLE =
>>>>> MemoryHandles.varHandle(ValueLayout.JAVA_INT);
>>>>>
>>>>> public static VarHandle SPEC_HANDLE =
>>>>> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 0);
>>>>>
>>>>> public static final VarHandle SPEC_HANDLE_FINAL =
>>>>> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 0);
>>>>>
>>>>> @Benchmark
>>>>> @BenchmarkMode(Mode.AverageTime)
>>>>> @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>> public void genericHandleBenchmark()
>>>>> {
>>>>> GENERIC_HANDLE.set(SEGMENT, 0, 5);
>>>>> }
>>>>>
>>>>> @Benchmark
>>>>> @BenchmarkMode(Mode.AverageTime)
>>>>> @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>> public void specHandleBenchmark()
>>>>> {
>>>>> SPEC_HANDLE.set(5);
>>>>> }
>>>>>
>>>>> @Benchmark
>>>>> @BenchmarkMode(Mode.AverageTime)
>>>>> @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>> public void specFinalHandleBenchmark()
>>>>> {
>>>>> SPEC_HANDLE_FINAL.set(5);
>>>>> }
>>>>>
>>>>>
>>>>> Sort of off-topic but... I don't remember anyone saying previously
>>>>> that insertCoordinates would give that big of a difference(or any at
>>>>> all!) so it's surprising to me. I was expecting a performance
>>>>> decrease due to the handle no longer being static-final. Can javac
>>>>> maybe optimize this so that any case where:
>>>>>
>>>>>
>>>>> GENERIC_HANDLE.set(SEGMENT, 0, 5);
>>>>>
>>>>>
>>>>> is, an optimized VarHandle is created at compile time that is
>>>>> equivalent to SPEC_HANDLE and inserted there instead?
>>>>>
>>>>>
>>>>> On 12/10/21 4:55 AM, Maurizio Cimadamore wrote:
>>>>>> (resending since mailing lists were down yesterday - I apologize if
>>>>>> this results in duplicates).
>>>>>>
>>>>>> Hi,
>>>>>> few days ago some VM enhancements were integrated [1, 2], so it is
>>>>>> time to take a look again at where we are.
>>>>>>
>>>>>> I put together a branch which removes all workarounds (both for long
>>>>>> loops and for alignment checks):
>>>>>>
>>>>>> https://github.com/mcimadamore/jdk/tree/long_loop_workarounds_removal
>>>>>>
>>>>>>
>>>>>> I also ran memory access benchmarks before/after, to see what the
>>>>>> difference is like - here's a visual report:
>>>>>>
>>>>>> https://jmh.morethan.io/?gists=dfa7075db33f7e6a2690ac80a64aa252,7f894f48460a6a0c9891cbe3158b43a7
>>>>>>
>>>>>>
>>>>>>
>>>>>> Overall, I think the numbers are solid. The branch w/o workarounds
>>>>>> keep up with mainline in basically all cases but one (UnrolledAccess
>>>>>> - this code pattern needs more work in the VM, but Roland Westrelin
>>>>>> has identified a possible fix for it). In some cases (parallel
>>>>>> tests) we see quite a big jump forward.
>>>>>>
>>>>>> I think it's hard to say how these results will translate in real
>>>>>> world - my gut feeling is that the simpler bound checking logic will
>>>>>> almost invariably result in performance improvements with more
>>>>>> complex code patterns, despite what synthetic benchmark might say
>>>>>> (the current logic in mainline is fragile as it has to guard against
>>>>>> integer overflow, which in turns sometimes kills BCE optimizations).
>>>>>>
>>>>>> So I'd be inclined to integrate these changes in 18.
>>>>>>
>>>>>> If you gave a project that works agaist the Java 18 API, it would be
>>>>>> very helpful for us if you could try it on the above branch and
>>>>>> report back. This will help us make a more informed decision.
>>>>>>
>>>>>> Cheers
>>>>>> Maurizio
>>>>>>
>>>>>> [1] - https://bugs.openjdk.java.net/browse/JDK-8276116
>>>>>> [2] - https://bugs.openjdk.java.net/browse/JDK-8277850
>>>>>>
>>>>>>
More information about the panama-dev
mailing list