an opencl binding - zcl/panama

Tue Jan 28 12:40:21 UTC 2020

So, I took a better look and I have some news.

The first thing tripping the benchmark up is this:

         int len = (int)(seg.byteSize() >>> 3);

If you replace it with:

         int len = ((int)seg.byteSize() >>> 3);

Or, even better, with:

         int len = ((int)seg.byteSize() / 8);

The segment code is already 2x faster. The reason is that the bound 
check elimination code in Hotspot is _extremely_ sensitive to the 
opcodes being used. Anything that does a long operation (e.g. long 
shift) and the logic will be disabled. This is currently our #1 priority 
- and is captured here:

https://bugs.openjdk.java.net/browse/JDK-8223051

There is also another issue, which is half a benchmark issue, half 
Hotspot. Basically, your check() methods are doing loads of get() calls, 
but only very few (4) set() calls. This leads to a situation where calls 
to Native.getXYZ are inlined as you would expect into the check() 
method, but calls to Native.setXYZ are not (because they are cold).

This creates issues with escape analysis (as C2 sees that the address on 
which you operate is 'escaping' onto another call) - which in turn 
disables loop optimizations (as these are based on the fact that the 
address is a loop invariant - which C2 cannot prove in this case - due 
to the lack of inlining for the setter call). Failures of EA with calls 
to MemorySegment::baseAddress and MemoryAddress::addOffset are also a 
known issue and they are captured here (**):

https://bugs.openjdk.java.net/browse/JDK-8237077

If rewrite check(MemorySegment) as follows (which helps C2 see that the 
hot loop is using an address which does not escape):

static void check(MemorySegment seg) {
         int len = ((int)seg.byteSize() / 8);
         long sum = 0;

         MemoryAddress add1 = seg.baseAddress();
         for (int i = 0; i < len; i++)
             sum += Native.getLong(add1, i);
         MemoryAddress add2 = seg.baseAddress();
         Native.setLong(add2, sum);
     }

Then the segment version comes out on top:

   0.497758726 array
   0.836574479 bb stream
   0.446651107 segment
   0.482202441 bb index
   2.767206835 bb over segment

Of course I'm not suggesting that the code you wrote doesn't make sense 
- I think this shows that (a) segments have the potential to be very 
fast but (b) we have some work to do on the VM side to smooth out the 
performance side of things.

Maurizio

(**) longer term, Valhalla inline types will make all these issues go 
away, but we need some ad-hoc mitigation in the interim

On 28/01/2020 09:24, Maurizio Cimadamore wrote:
> Found it:
>
> https://code.zedzone.space/cvs?p=zcl;a=blob;f=src/notzed.zcl.demo/classes/au/notzed/zcl/test/TestMemoryLong.java;h=55e1c5051c7bfdfabfb62c3bbbe21d1e3f90e7aa;hb=refs/heads/foreign-abi 
>
>
> Interesting - you are testing multiple segments with the same 
> benchmark method. I'll try to replicate something similar with our JMH 
> infra.
>
> Maurizio
>
> On 28/01/2020 08:39, Maurizio Cimadamore wrote:
>> If you have the test somewhere I'd love to take a look and maybe port 
>> it on top of JMH. We are looking at these performance potholes now, 
>> so it is a great time to report such issues.