status of VM long loop optimizations - call for action

Fri Dec 10 22:06:32 UTC 2021

Hi,
I don't think the 1ns difference is real - if you look at the error in 
the second run is higher than that, so it's in the noise.

And, since there's no loop, I don't think this specific kind of 
benchmark should be affected in any way by the VM improvements. What the 
VM can help with is to remove bound checks when you keep accessing a 
segment in a loop, as C2 is now able to correctly apply an optimization 
called "bound check elimination" or BCE. This optimization is routinely 
applied on Java array access, but it used to fail for memory segments 
because the bound of a memory segment is stored in a long variable, not 
an int.

That said, note that you are passing inexact arguments to the var handle 
(e.g. you are passing an int offset instead of a long one; try to use 
"0L" instead of "0").

Maurizio

On 10/12/2021 21:34, Ty Young wrote:
> A simple write benchmark I had already made for specialized 
> VarHandles(AKA insertCoordinates) seems to get about 1ns consistently 
> faster, so I guess these changes helped a bit?
>
>
> Before:
>
>
> Benchmark                                    Mode  Cnt   Score Error  
> Units
> VarHandleBenchmark.genericHandleBenchmark    avgt    5  21.155 ± 
> 0.145  ns/op
> VarHandleBenchmark.specFinalHandleBenchmark  avgt    5   0.678 ± 
> 0.201  ns/op
> VarHandleBenchmark.specHandleBenchmark       avgt    5  17.323 ± 
> 1.324  ns/op
>
>
> After:
>
>
> Benchmark                                    Mode  Cnt   Score Error  
> Units
> VarHandleBenchmark.genericHandleBenchmark    avgt    5  20.304 ± 
> 1.466  ns/op
> VarHandleBenchmark.specFinalHandleBenchmark  avgt    5   0.652 ± 
> 0.156  ns/op
> VarHandleBenchmark.specHandleBenchmark       avgt    5  17.266 ± 
> 1.712  ns/op
>
>
> Benchmark:
>
>
>     public static final MemorySegment SEGMENT = 
> MemorySegment.allocateNative(ValueLayout.JAVA_INT, 
> ResourceScope.newSharedScope());
>
>     public static final VarHandle GENERIC_HANDLE = 
> MemoryHandles.varHandle(ValueLayout.JAVA_INT);
>
>     public static VarHandle SPEC_HANDLE = 
> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 0);
>
>     public static final VarHandle SPEC_HANDLE_FINAL = 
> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 0);
>
>     @Benchmark
>     @BenchmarkMode(Mode.AverageTime)
>     @OutputTimeUnit(TimeUnit.NANOSECONDS)
>     public void genericHandleBenchmark()
>     {
>         GENERIC_HANDLE.set(SEGMENT, 0, 5);
>     }
>
>     @Benchmark
>     @BenchmarkMode(Mode.AverageTime)
>     @OutputTimeUnit(TimeUnit.NANOSECONDS)
>     public void specHandleBenchmark()
>     {
>         SPEC_HANDLE.set(5);
>     }
>
>     @Benchmark
>     @BenchmarkMode(Mode.AverageTime)
>     @OutputTimeUnit(TimeUnit.NANOSECONDS)
>     public void specFinalHandleBenchmark()
>     {
>         SPEC_HANDLE_FINAL.set(5);
>     }
>
>
> Sort of off-topic but... I don't remember anyone saying previously 
> that insertCoordinates would give that big of a difference(or any at 
> all!) so it's surprising to me. I was expecting a performance decrease 
> due to the handle no longer being static-final. Can javac maybe 
> optimize this so that any case where:
>
>
> GENERIC_HANDLE.set(SEGMENT, 0, 5);
>
>
> is, an optimized VarHandle is created at compile time that is 
> equivalent to SPEC_HANDLE and inserted there instead?
>
>
> On 12/10/21 4:55 AM, Maurizio Cimadamore wrote:
>> (resending since mailing lists were down yesterday - I apologize if 
>> this results in duplicates).
>>
>> Hi,
>> few days ago some VM enhancements were integrated [1, 2], so it is 
>> time to take a look again at where we are.
>>
>> I put together a branch which removes all workarounds (both for long 
>> loops and for alignment checks):
>>
>> https://github.com/mcimadamore/jdk/tree/long_loop_workarounds_removal
>>
>> I also ran memory access benchmarks before/after, to see what the 
>> difference is like - here's a visual report:
>>
>> https://jmh.morethan.io/?gists=dfa7075db33f7e6a2690ac80a64aa252,7f894f48460a6a0c9891cbe3158b43a7 
>>
>>
>> Overall, I think the numbers are solid. The branch w/o workarounds 
>> keep up with mainline in basically all cases but one (UnrolledAccess 
>> - this code pattern needs more work in the VM, but Roland Westrelin 
>> has identified a possible fix for it). In some cases (parallel tests) 
>> we see quite a big jump forward.
>>
>> I think it's hard to say how these results will translate in real 
>> world - my gut feeling is that the simpler bound checking logic will 
>> almost invariably result in performance improvements with more 
>> complex code patterns, despite what synthetic benchmark might say 
>> (the current logic in mainline is fragile as it has to guard against 
>> integer overflow, which in turns sometimes kills BCE optimizations).
>>
>> So I'd be inclined to integrate these changes in 18.
>>
>> If you gave a project that works agaist the Java 18 API, it would be 
>> very helpful for us if you could try it on the above branch and 
>> report back. This will help us make a more informed decision.
>>
>> Cheers
>> Maurizio
>>
>> [1] - https://bugs.openjdk.java.net/browse/JDK-8276116
>> [2] - https://bugs.openjdk.java.net/browse/JDK-8277850
>>
>>
>>