status of VM long loop optimizations - call for action
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Fri Dec 10 22:06:32 UTC 2021
Hi,
I don't think the 1ns difference is real - if you look at the error in
the second run is higher than that, so it's in the noise.
And, since there's no loop, I don't think this specific kind of
benchmark should be affected in any way by the VM improvements. What the
VM can help with is to remove bound checks when you keep accessing a
segment in a loop, as C2 is now able to correctly apply an optimization
called "bound check elimination" or BCE. This optimization is routinely
applied on Java array access, but it used to fail for memory segments
because the bound of a memory segment is stored in a long variable, not
an int.
That said, note that you are passing inexact arguments to the var handle
(e.g. you are passing an int offset instead of a long one; try to use
"0L" instead of "0").
Maurizio
On 10/12/2021 21:34, Ty Young wrote:
> A simple write benchmark I had already made for specialized
> VarHandles(AKA insertCoordinates) seems to get about 1ns consistently
> faster, so I guess these changes helped a bit?
>
>
> Before:
>
>
> Benchmark Mode Cnt Score Error
> Units
> VarHandleBenchmark.genericHandleBenchmark avgt 5 21.155 ±
> 0.145 ns/op
> VarHandleBenchmark.specFinalHandleBenchmark avgt 5 0.678 ±
> 0.201 ns/op
> VarHandleBenchmark.specHandleBenchmark avgt 5 17.323 ±
> 1.324 ns/op
>
>
> After:
>
>
> Benchmark Mode Cnt Score Error
> Units
> VarHandleBenchmark.genericHandleBenchmark avgt 5 20.304 ±
> 1.466 ns/op
> VarHandleBenchmark.specFinalHandleBenchmark avgt 5 0.652 ±
> 0.156 ns/op
> VarHandleBenchmark.specHandleBenchmark avgt 5 17.266 ±
> 1.712 ns/op
>
>
> Benchmark:
>
>
> public static final MemorySegment SEGMENT =
> MemorySegment.allocateNative(ValueLayout.JAVA_INT,
> ResourceScope.newSharedScope());
>
> public static final VarHandle GENERIC_HANDLE =
> MemoryHandles.varHandle(ValueLayout.JAVA_INT);
>
> public static VarHandle SPEC_HANDLE =
> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 0);
>
> public static final VarHandle SPEC_HANDLE_FINAL =
> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 0);
>
> @Benchmark
> @BenchmarkMode(Mode.AverageTime)
> @OutputTimeUnit(TimeUnit.NANOSECONDS)
> public void genericHandleBenchmark()
> {
> GENERIC_HANDLE.set(SEGMENT, 0, 5);
> }
>
> @Benchmark
> @BenchmarkMode(Mode.AverageTime)
> @OutputTimeUnit(TimeUnit.NANOSECONDS)
> public void specHandleBenchmark()
> {
> SPEC_HANDLE.set(5);
> }
>
> @Benchmark
> @BenchmarkMode(Mode.AverageTime)
> @OutputTimeUnit(TimeUnit.NANOSECONDS)
> public void specFinalHandleBenchmark()
> {
> SPEC_HANDLE_FINAL.set(5);
> }
>
>
> Sort of off-topic but... I don't remember anyone saying previously
> that insertCoordinates would give that big of a difference(or any at
> all!) so it's surprising to me. I was expecting a performance decrease
> due to the handle no longer being static-final. Can javac maybe
> optimize this so that any case where:
>
>
> GENERIC_HANDLE.set(SEGMENT, 0, 5);
>
>
> is, an optimized VarHandle is created at compile time that is
> equivalent to SPEC_HANDLE and inserted there instead?
>
>
> On 12/10/21 4:55 AM, Maurizio Cimadamore wrote:
>> (resending since mailing lists were down yesterday - I apologize if
>> this results in duplicates).
>>
>> Hi,
>> few days ago some VM enhancements were integrated [1, 2], so it is
>> time to take a look again at where we are.
>>
>> I put together a branch which removes all workarounds (both for long
>> loops and for alignment checks):
>>
>> https://github.com/mcimadamore/jdk/tree/long_loop_workarounds_removal
>>
>> I also ran memory access benchmarks before/after, to see what the
>> difference is like - here's a visual report:
>>
>> https://jmh.morethan.io/?gists=dfa7075db33f7e6a2690ac80a64aa252,7f894f48460a6a0c9891cbe3158b43a7
>>
>>
>> Overall, I think the numbers are solid. The branch w/o workarounds
>> keep up with mainline in basically all cases but one (UnrolledAccess
>> - this code pattern needs more work in the VM, but Roland Westrelin
>> has identified a possible fix for it). In some cases (parallel tests)
>> we see quite a big jump forward.
>>
>> I think it's hard to say how these results will translate in real
>> world - my gut feeling is that the simpler bound checking logic will
>> almost invariably result in performance improvements with more
>> complex code patterns, despite what synthetic benchmark might say
>> (the current logic in mainline is fragile as it has to guard against
>> integer overflow, which in turns sometimes kills BCE optimizations).
>>
>> So I'd be inclined to integrate these changes in 18.
>>
>> If you gave a project that works agaist the Java 18 API, it would be
>> very helpful for us if you could try it on the above branch and
>> report back. This will help us make a more informed decision.
>>
>> Cheers
>> Maurizio
>>
>> [1] - https://bugs.openjdk.java.net/browse/JDK-8276116
>> [2] - https://bugs.openjdk.java.net/browse/JDK-8277850
>>
>>
>>
More information about the panama-dev
mailing list