[vectorIntrinsics] RFR: Optimize mem barriers for ByteBuffer cases

Fri Jul 30 19:58:44 UTC 2021

On Tue, 27 Jul 2021 20:42:13 GMT, Radoslaw Smogura <github.com+7535718+rsmogura at openjdk.org> wrote:

> # Description
> This change tries to remove mem bars for byte buffer cases.
> 
> Previously mem bars were inserted almost unconditionally if attemp to native memory acees where detected. This patch tries to follow up inline_unsafe_access and insert bar only if can't determine if it's heap or off-heap (type missmatch cases are not ported).
> 
> # Testing
> Memory tests should include rollbacking JDK changes, and leaving only hotspot, as intrinsics should be well guarded
> 
> # Notes
> Polluted cases to be addressed later
> 
> # Benchmarks
> 
> Benchmark                                (size)  Mode  Cnt    Score   Error  Units
> ByteBufferVectorAccess.arrays              1024  avgt   10   12.585 ? 0.409  ns/op
> ByteBufferVectorAccess.directBuffers       1024  avgt   10   19.962 ? 0.080  ns/op
> ByteBufferVectorAccess.heapBuffers         1024  avgt   10   15.878 ? 0.187  ns/op
> ByteBufferVectorAccess.pollutedBuffers2    1024  avgt   10  123.702 ? 0.723  ns/op
> ByteBufferVectorAccess.pollutedBuffers3    1024  avgt   10  223.928 ? 1.906  ns/op
> 
> Before
> 
> Benchmark                                (size)  Mode  Cnt    Score   Error  Units
> ByteBufferVectorAccess.arrays              1024  avgt   10   14.730 ? 0.061  ns/op
> ByteBufferVectorAccess.directBuffers       1024  avgt   10   77.707 ? 4.867  ns/op
> ByteBufferVectorAccess.heapBuffers         1024  avgt   10   76.530 ? 1.076  ns/op
> ByteBufferVectorAccess.pollutedBuffers2    1024  avgt   10  143.331 ? 1.096  ns/op
> ByteBufferVectorAccess.pollutedBuffers3    1024  avgt   10  286.645 ? 3.444  ns/op

To use decorators I would  move C2AccessFence to hpp file (right now it's CPP private) - this class actually is reponsible for setting up barriers. I think this can be a good idea.

So, with polluted access there are actually two things which can happen - both resulting in same performance degradation.

In polluted case, If we will have loadFromByteBufferScoped without "if" , the intrinsic will not be able to detect if access is native or not and will start putting barriers.

With "if" enabled - there's a less optimized graph generated by hotspot. Mainly "if" causes memory merge which is marked as bot, reading of hb, limit fields takes memory from this merge and phi - in perfect world should not as merge does not affect this slice. As a consequence this fields are read with every loop pass - and are not pulled outside loop.

I think I already found a good (or at least) working idea how to overcome this purely in Java. 

Funny thing I've found a code in hotspot which theoretically can help, but it requires tunning [1].

[1] https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/cfgnode.cpp#L2212

-------------

PR: https://git.openjdk.java.net/panama-vector/pull/104