status of VM long loop optimizations - call for action

Fri Dec 10 22:18:45 UTC 2021

Yeah, I forgot that. Apologies.

On 12/10/21 4:06 PM, Maurizio Cimadamore wrote:
> Hi,
> I don't think the 1ns difference is real - if you look at the error in 
> the second run is higher than that, so it's in the noise.
>
> And, since there's no loop, I don't think this specific kind of 
> benchmark should be affected in any way by the VM improvements. What 
> the VM can help with is to remove bound checks when you keep accessing 
> a segment in a loop, as C2 is now able to correctly apply an 
> optimization called "bound check elimination" or BCE. This 
> optimization is routinely applied on Java array access, but it used to 
> fail for memory segments because the bound of a memory segment is 
> stored in a long variable, not an int.
>
> That said, note that you are passing inexact arguments to the var 
> handle (e.g. you are passing an int offset instead of a long one; try 
> to use "0L" instead of "0").
>
> Maurizio
>
>
> On 10/12/2021 21:34, Ty Young wrote:
>> A simple write benchmark I had already made for specialized 
>> VarHandles(AKA insertCoordinates) seems to get about 1ns consistently 
>> faster, so I guess these changes helped a bit?
>>
>>
>> Before:
>>
>>
>> Benchmark                                    Mode  Cnt   Score Error  
>> Units
>> VarHandleBenchmark.genericHandleBenchmark    avgt    5  21.155 ± 
>> 0.145  ns/op
>> VarHandleBenchmark.specFinalHandleBenchmark  avgt    5   0.678 ± 
>> 0.201  ns/op
>> VarHandleBenchmark.specHandleBenchmark       avgt    5  17.323 ± 
>> 1.324  ns/op
>>
>>
>> After:
>>
>>
>> Benchmark                                    Mode  Cnt   Score Error  
>> Units
>> VarHandleBenchmark.genericHandleBenchmark    avgt    5  20.304 ± 
>> 1.466  ns/op
>> VarHandleBenchmark.specFinalHandleBenchmark  avgt    5   0.652 ± 
>> 0.156  ns/op
>> VarHandleBenchmark.specHandleBenchmark       avgt    5  17.266 ± 
>> 1.712  ns/op
>>
>>
>> Benchmark:
>>
>>
>>     public static final MemorySegment SEGMENT = 
>> MemorySegment.allocateNative(ValueLayout.JAVA_INT, 
>> ResourceScope.newSharedScope());
>>
>>     public static final VarHandle GENERIC_HANDLE = 
>> MemoryHandles.varHandle(ValueLayout.JAVA_INT);
>>
>>     public static VarHandle SPEC_HANDLE = 
>> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 0);
>>
>>     public static final VarHandle SPEC_HANDLE_FINAL = 
>> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 0);
>>
>>     @Benchmark
>>     @BenchmarkMode(Mode.AverageTime)
>>     @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>     public void genericHandleBenchmark()
>>     {
>>         GENERIC_HANDLE.set(SEGMENT, 0, 5);
>>     }
>>
>>     @Benchmark
>>     @BenchmarkMode(Mode.AverageTime)
>>     @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>     public void specHandleBenchmark()
>>     {
>>         SPEC_HANDLE.set(5);
>>     }
>>
>>     @Benchmark
>>     @BenchmarkMode(Mode.AverageTime)
>>     @OutputTimeUnit(TimeUnit.NANOSECONDS)
>>     public void specFinalHandleBenchmark()
>>     {
>>         SPEC_HANDLE_FINAL.set(5);
>>     }
>>
>>
>> Sort of off-topic but... I don't remember anyone saying previously 
>> that insertCoordinates would give that big of a difference(or any at 
>> all!) so it's surprising to me. I was expecting a performance 
>> decrease due to the handle no longer being static-final. Can javac 
>> maybe optimize this so that any case where:
>>
>>
>> GENERIC_HANDLE.set(SEGMENT, 0, 5);
>>
>>
>> is, an optimized VarHandle is created at compile time that is 
>> equivalent to SPEC_HANDLE and inserted there instead?
>>
>>
>> On 12/10/21 4:55 AM, Maurizio Cimadamore wrote:
>>> (resending since mailing lists were down yesterday - I apologize if 
>>> this results in duplicates).
>>>
>>> Hi,
>>> few days ago some VM enhancements were integrated [1, 2], so it is 
>>> time to take a look again at where we are.
>>>
>>> I put together a branch which removes all workarounds (both for long 
>>> loops and for alignment checks):
>>>
>>> https://github.com/mcimadamore/jdk/tree/long_loop_workarounds_removal
>>>
>>> I also ran memory access benchmarks before/after, to see what the 
>>> difference is like - here's a visual report:
>>>
>>> https://jmh.morethan.io/?gists=dfa7075db33f7e6a2690ac80a64aa252,7f894f48460a6a0c9891cbe3158b43a7 
>>>
>>>
>>> Overall, I think the numbers are solid. The branch w/o workarounds 
>>> keep up with mainline in basically all cases but one (UnrolledAccess 
>>> - this code pattern needs more work in the VM, but Roland Westrelin 
>>> has identified a possible fix for it). In some cases (parallel 
>>> tests) we see quite a big jump forward.
>>>
>>> I think it's hard to say how these results will translate in real 
>>> world - my gut feeling is that the simpler bound checking logic will 
>>> almost invariably result in performance improvements with more 
>>> complex code patterns, despite what synthetic benchmark might say 
>>> (the current logic in mainline is fragile as it has to guard against 
>>> integer overflow, which in turns sometimes kills BCE optimizations).
>>>
>>> So I'd be inclined to integrate these changes in 18.
>>>
>>> If you gave a project that works agaist the Java 18 API, it would be 
>>> very helpful for us if you could try it on the above branch and 
>>> report back. This will help us make a more informed decision.
>>>
>>> Cheers
>>> Maurizio
>>>
>>> [1] - https://bugs.openjdk.java.net/browse/JDK-8276116
>>> [2] - https://bugs.openjdk.java.net/browse/JDK-8277850
>>>
>>>
>>>