status of VM long loop optimizations - call for action

Sat Dec 11 13:53:55 UTC 2021

Hi Maurizio,

Checked against JExtract branch 2617fbfa3050913d34906f87027b8be8f10e53a9

Project: https://github.com/rsmogura/panama-io

Benchmark                          Mode  Cnt        Score Error  Units
SocketReadJdk.teatRead4k          thrpt    5   947424.435 ± 74922.610  ops/s
SocketReadJdk.testRead16b         thrpt    5  1823338.685 ± 33626.860  ops/s
SocketReadJdk.testRead8bOffset    thrpt    5  1817956.804 ± 25456.785  ops/s
SocketReadPosix.teatRead4k        thrpt    5  1205470.257 ± 548343.499  
ops/s
SocketReadPosix.testRead16b       thrpt    5  2710119.664 ± 227053.749  
ops/s
SocketReadPosix.testRead8bOffset  thrpt    5  2968281.197 ± 216628.917  
ops/s

Numbers look amazing - I have to check if it's still does what it's 
intended to do (so write some integration test).

Kind regards,

Rado

On 10.12.2021 23:33, Remi Forax wrote:
> Hi Ty,
> there is a simple trick to be sure to get the best performance.
>
> When you create the VarHandle, call withInvokeExactBehavior [1] on it,
> the returned VarHandle will throw an error at runtime instead of trying to convert arguments.
>
> Rémi
>
> [1] https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/lang/invoke/VarHandle.html#withInvokeExactBehavior()
>
> ----- Original Message -----
>> From: "Ty Young" <youngty1997 at gmail.com>
>> To: "Maurizio Cimadamore" <maurizio.cimadamore at oracle.com>, "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
>> Sent: Friday, December 10, 2021 11:18:45 PM
>> Subject: Re: status of VM long loop optimizations - call for action
>> Yeah, I forgot that. Apologies.
>>
>>
>> On 12/10/21 4:06 PM, Maurizio Cimadamore wrote:
>>> Hi,
>>> I don't think the 1ns difference is real - if you look at the error in
>>> the second run is higher than that, so it's in the noise.
>>>
>>> And, since there's no loop, I don't think this specific kind of
>>> benchmark should be affected in any way by the VM improvements. What
>>> the VM can help with is to remove bound checks when you keep accessing
>>> a segment in a loop, as C2 is now able to correctly apply an
>>> optimization called "bound check elimination" or BCE. This
>>> optimization is routinely applied on Java array access, but it used to
>>> fail for memory segments because the bound of a memory segment is
>>> stored in a long variable, not an int.
>>>
>>> That said, note that you are passing inexact arguments to the var
>>> handle (e.g. you are passing an int offset instead of a long one; try
>>> to use "0L" instead of "0").
>>>
>>> Maurizio
>>>
>>>
>>> On 10/12/2021 21:34, Ty Young wrote:
>>>> A simple write benchmark I had already made for specialized
>>>> VarHandles(AKA insertCoordinates) seems to get about 1ns consistently
>>>> faster, so I guess these changes helped a bit?
>>>>
>>>>
>>>> Before:
>>>>
>>>>
>>>> Benchmark                                    Mode  Cnt   Score Error
>>>> Units
>>>> VarHandleBenchmark.genericHandleBenchmark    avgt    5  21.155 ±
>>>> 0.145  ns/op
>>>> VarHandleBenchmark.specFinalHandleBenchmark  avgt    5   0.678 ±
>>>> 0.201  ns/op
>>>> VarHandleBenchmark.specHandleBenchmark       avgt    5  17.323 ±
>>>> 1.324  ns/op
>>>>
>>>>
>>>> After:
>>>>
>>>>
>>>> Benchmark                                    Mode  Cnt   Score Error
>>>> Units
>>>> VarHandleBenchmark.genericHandleBenchmark    avgt    5  20.304 ±
>>>> 1.466  ns/op
>>>> VarHandleBenchmark.specFinalHandleBenchmark  avgt    5   0.652 ±
>>>> 0.156  ns/op
>>>> VarHandleBenchmark.specHandleBenchmark       avgt    5  17.266 ±
>>>> 1.712  ns/op
>>>>
>>>>
>>>> Benchmark:
>>>>
>>>>
>>>>      public static final MemorySegment SEGMENT =
>>>> MemorySegment.allocateNative(ValueLayout.JAVA_INT,
>>>> ResourceScope.newSharedScope());
>>>>
>>>>      public static final VarHandle GENERIC_HANDLE =
>>>> MemoryHandles.varHandle(ValueLayout.JAVA_INT);
>>>>
>>>>      public static VarHandle SPEC_HANDLE =
>>>> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 0);
>>>>
>>>>      public static final VarHandle SPEC_HANDLE_FINAL =
>>>> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 0);
>>>>
>>>>      @Benchmark
>>>>      @BenchmarkMode(Mode.AverageTime)
>>>>      @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>      public void genericHandleBenchmark()
>>>>      {
>>>>          GENERIC_HANDLE.set(SEGMENT, 0, 5);
>>>>      }
>>>>
>>>>      @Benchmark
>>>>      @BenchmarkMode(Mode.AverageTime)
>>>>      @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>      public void specHandleBenchmark()
>>>>      {
>>>>          SPEC_HANDLE.set(5);
>>>>      }
>>>>
>>>>      @Benchmark
>>>>      @BenchmarkMode(Mode.AverageTime)
>>>>      @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>>>      public void specFinalHandleBenchmark()
>>>>      {
>>>>          SPEC_HANDLE_FINAL.set(5);
>>>>      }
>>>>
>>>>
>>>> Sort of off-topic but... I don't remember anyone saying previously
>>>> that insertCoordinates would give that big of a difference(or any at
>>>> all!) so it's surprising to me. I was expecting a performance
>>>> decrease due to the handle no longer being static-final. Can javac
>>>> maybe optimize this so that any case where:
>>>>
>>>>
>>>> GENERIC_HANDLE.set(SEGMENT, 0, 5);
>>>>
>>>>
>>>> is, an optimized VarHandle is created at compile time that is
>>>> equivalent to SPEC_HANDLE and inserted there instead?
>>>>
>>>>
>>>> On 12/10/21 4:55 AM, Maurizio Cimadamore wrote:
>>>>> (resending since mailing lists were down yesterday - I apologize if
>>>>> this results in duplicates).
>>>>>
>>>>> Hi,
>>>>> few days ago some VM enhancements were integrated [1, 2], so it is
>>>>> time to take a look again at where we are.
>>>>>
>>>>> I put together a branch which removes all workarounds (both for long
>>>>> loops and for alignment checks):
>>>>>
>>>>> https://github.com/mcimadamore/jdk/tree/long_loop_workarounds_removal
>>>>>
>>>>> I also ran memory access benchmarks before/after, to see what the
>>>>> difference is like - here's a visual report:
>>>>>
>>>>> https://jmh.morethan.io/?gists=dfa7075db33f7e6a2690ac80a64aa252,7f894f48460a6a0c9891cbe3158b43a7
>>>>>
>>>>>
>>>>> Overall, I think the numbers are solid. The branch w/o workarounds
>>>>> keep up with mainline in basically all cases but one (UnrolledAccess
>>>>> - this code pattern needs more work in the VM, but Roland Westrelin
>>>>> has identified a possible fix for it). In some cases (parallel
>>>>> tests) we see quite a big jump forward.
>>>>>
>>>>> I think it's hard to say how these results will translate in real
>>>>> world - my gut feeling is that the simpler bound checking logic will
>>>>> almost invariably result in performance improvements with more
>>>>> complex code patterns, despite what synthetic benchmark might say
>>>>> (the current logic in mainline is fragile as it has to guard against
>>>>> integer overflow, which in turns sometimes kills BCE optimizations).
>>>>>
>>>>> So I'd be inclined to integrate these changes in 18.
>>>>>
>>>>> If you gave a project that works agaist the Java 18 API, it would be
>>>>> very helpful for us if you could try it on the above branch and
>>>>> report back. This will help us make a more informed decision.
>>>>>
>>>>> Cheers
>>>>> Maurizio
>>>>>
>>>>> [1] - https://bugs.openjdk.java.net/browse/JDK-8276116
>>>>> [2] - https://bugs.openjdk.java.net/browse/JDK-8277850
>>>>>
>>>>>