an opencl binding - zcl/panama
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Tue Jan 28 12:40:21 UTC 2020
So, I took a better look and I have some news.
The first thing tripping the benchmark up is this:
int len = (int)(seg.byteSize() >>> 3);
If you replace it with:
int len = ((int)seg.byteSize() >>> 3);
Or, even better, with:
int len = ((int)seg.byteSize() / 8);
The segment code is already 2x faster. The reason is that the bound
check elimination code in Hotspot is _extremely_ sensitive to the
opcodes being used. Anything that does a long operation (e.g. long
shift) and the logic will be disabled. This is currently our #1 priority
- and is captured here:
https://bugs.openjdk.java.net/browse/JDK-8223051
There is also another issue, which is half a benchmark issue, half
Hotspot. Basically, your check() methods are doing loads of get() calls,
but only very few (4) set() calls. This leads to a situation where calls
to Native.getXYZ are inlined as you would expect into the check()
method, but calls to Native.setXYZ are not (because they are cold).
This creates issues with escape analysis (as C2 sees that the address on
which you operate is 'escaping' onto another call) - which in turn
disables loop optimizations (as these are based on the fact that the
address is a loop invariant - which C2 cannot prove in this case - due
to the lack of inlining for the setter call). Failures of EA with calls
to MemorySegment::baseAddress and MemoryAddress::addOffset are also a
known issue and they are captured here (**):
https://bugs.openjdk.java.net/browse/JDK-8237077
If rewrite check(MemorySegment) as follows (which helps C2 see that the
hot loop is using an address which does not escape):
static void check(MemorySegment seg) {
int len = ((int)seg.byteSize() / 8);
long sum = 0;
MemoryAddress add1 = seg.baseAddress();
for (int i = 0; i < len; i++)
sum += Native.getLong(add1, i);
MemoryAddress add2 = seg.baseAddress();
Native.setLong(add2, sum);
}
Then the segment version comes out on top:
0.497758726 array
0.836574479 bb stream
0.446651107 segment
0.482202441 bb index
2.767206835 bb over segment
Of course I'm not suggesting that the code you wrote doesn't make sense
- I think this shows that (a) segments have the potential to be very
fast but (b) we have some work to do on the VM side to smooth out the
performance side of things.
Maurizio
(**) longer term, Valhalla inline types will make all these issues go
away, but we need some ad-hoc mitigation in the interim
On 28/01/2020 09:24, Maurizio Cimadamore wrote:
> Found it:
>
> https://code.zedzone.space/cvs?p=zcl;a=blob;f=src/notzed.zcl.demo/classes/au/notzed/zcl/test/TestMemoryLong.java;h=55e1c5051c7bfdfabfb62c3bbbe21d1e3f90e7aa;hb=refs/heads/foreign-abi
>
>
> Interesting - you are testing multiple segments with the same
> benchmark method. I'll try to replicate something similar with our JMH
> infra.
>
> Maurizio
>
> On 28/01/2020 08:39, Maurizio Cimadamore wrote:
>> If you have the test somewhere I'd love to take a look and maybe port
>> it on top of JMH. We are looking at these performance potholes now,
>> so it is a great time to report such issues.
More information about the panama-dev
mailing list