status of VM long loop optimizations - call for action
Ty Young
youngty1997 at gmail.com
Fri Dec 10 22:18:45 UTC 2021
Yeah, I forgot that. Apologies.
On 12/10/21 4:06 PM, Maurizio Cimadamore wrote:
> Hi,
> I don't think the 1ns difference is real - if you look at the error in
> the second run is higher than that, so it's in the noise.
>
> And, since there's no loop, I don't think this specific kind of
> benchmark should be affected in any way by the VM improvements. What
> the VM can help with is to remove bound checks when you keep accessing
> a segment in a loop, as C2 is now able to correctly apply an
> optimization called "bound check elimination" or BCE. This
> optimization is routinely applied on Java array access, but it used to
> fail for memory segments because the bound of a memory segment is
> stored in a long variable, not an int.
>
> That said, note that you are passing inexact arguments to the var
> handle (e.g. you are passing an int offset instead of a long one; try
> to use "0L" instead of "0").
>
> Maurizio
>
>
> On 10/12/2021 21:34, Ty Young wrote:
>> A simple write benchmark I had already made for specialized
>> VarHandles(AKA insertCoordinates) seems to get about 1ns consistently
>> faster, so I guess these changes helped a bit?
>>
>>
>> Before:
>>
>>
>> Benchmark Mode Cnt Score Error
>> Units
>> VarHandleBenchmark.genericHandleBenchmark avgt 5 21.155 ±
>> 0.145 ns/op
>> VarHandleBenchmark.specFinalHandleBenchmark avgt 5 0.678 ±
>> 0.201 ns/op
>> VarHandleBenchmark.specHandleBenchmark avgt 5 17.323 ±
>> 1.324 ns/op
>>
>>
>> After:
>>
>>
>> Benchmark Mode Cnt Score Error
>> Units
>> VarHandleBenchmark.genericHandleBenchmark avgt 5 20.304 ±
>> 1.466 ns/op
>> VarHandleBenchmark.specFinalHandleBenchmark avgt 5 0.652 ±
>> 0.156 ns/op
>> VarHandleBenchmark.specHandleBenchmark avgt 5 17.266 ±
>> 1.712 ns/op
>>
>>
>> Benchmark:
>>
>>
>> public static final MemorySegment SEGMENT =
>> MemorySegment.allocateNative(ValueLayout.JAVA_INT,
>> ResourceScope.newSharedScope());
>>
>> public static final VarHandle GENERIC_HANDLE =
>> MemoryHandles.varHandle(ValueLayout.JAVA_INT);
>>
>> public static VarHandle SPEC_HANDLE =
>> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 0);
>>
>> public static final VarHandle SPEC_HANDLE_FINAL =
>> MemoryHandles.insertCoordinates(GENERIC_HANDLE, 0, SEGMENT, 0);
>>
>> @Benchmark
>> @BenchmarkMode(Mode.AverageTime)
>> @OutputTimeUnit(TimeUnit.NANOSECONDS)
>> public void genericHandleBenchmark()
>> {
>> GENERIC_HANDLE.set(SEGMENT, 0, 5);
>> }
>>
>> @Benchmark
>> @BenchmarkMode(Mode.AverageTime)
>> @OutputTimeUnit(TimeUnit.NANOSECONDS)
>> public void specHandleBenchmark()
>> {
>> SPEC_HANDLE.set(5);
>> }
>>
>> @Benchmark
>> @BenchmarkMode(Mode.AverageTime)
>> @OutputTimeUnit(TimeUnit.NANOSECONDS)
>> public void specFinalHandleBenchmark()
>> {
>> SPEC_HANDLE_FINAL.set(5);
>> }
>>
>>
>> Sort of off-topic but... I don't remember anyone saying previously
>> that insertCoordinates would give that big of a difference(or any at
>> all!) so it's surprising to me. I was expecting a performance
>> decrease due to the handle no longer being static-final. Can javac
>> maybe optimize this so that any case where:
>>
>>
>> GENERIC_HANDLE.set(SEGMENT, 0, 5);
>>
>>
>> is, an optimized VarHandle is created at compile time that is
>> equivalent to SPEC_HANDLE and inserted there instead?
>>
>>
>> On 12/10/21 4:55 AM, Maurizio Cimadamore wrote:
>>> (resending since mailing lists were down yesterday - I apologize if
>>> this results in duplicates).
>>>
>>> Hi,
>>> few days ago some VM enhancements were integrated [1, 2], so it is
>>> time to take a look again at where we are.
>>>
>>> I put together a branch which removes all workarounds (both for long
>>> loops and for alignment checks):
>>>
>>> https://github.com/mcimadamore/jdk/tree/long_loop_workarounds_removal
>>>
>>> I also ran memory access benchmarks before/after, to see what the
>>> difference is like - here's a visual report:
>>>
>>> https://jmh.morethan.io/?gists=dfa7075db33f7e6a2690ac80a64aa252,7f894f48460a6a0c9891cbe3158b43a7
>>>
>>>
>>> Overall, I think the numbers are solid. The branch w/o workarounds
>>> keep up with mainline in basically all cases but one (UnrolledAccess
>>> - this code pattern needs more work in the VM, but Roland Westrelin
>>> has identified a possible fix for it). In some cases (parallel
>>> tests) we see quite a big jump forward.
>>>
>>> I think it's hard to say how these results will translate in real
>>> world - my gut feeling is that the simpler bound checking logic will
>>> almost invariably result in performance improvements with more
>>> complex code patterns, despite what synthetic benchmark might say
>>> (the current logic in mainline is fragile as it has to guard against
>>> integer overflow, which in turns sometimes kills BCE optimizations).
>>>
>>> So I'd be inclined to integrate these changes in 18.
>>>
>>> If you gave a project that works agaist the Java 18 API, it would be
>>> very helpful for us if you could try it on the above branch and
>>> report back. This will help us make a more informed decision.
>>>
>>> Cheers
>>> Maurizio
>>>
>>> [1] - https://bugs.openjdk.java.net/browse/JDK-8276116
>>> [2] - https://bugs.openjdk.java.net/browse/JDK-8277850
>>>
>>>
>>>
More information about the panama-dev
mailing list