[vectorIntrinsics] RFR: Optimize mem barriers for ByteBuffer cases [v10]

Thu Aug 5 18:16:45 UTC 2021

On Wed, 4 Aug 2021 22:44:09 GMT, Radoslaw Smogura <github.com+7535718+rsmogura at openjdk.org> wrote:

>> # Description
>> This change tries to remove mem bars for byte buffer cases.
>> 
>> Previously mem bars were inserted almost unconditionally if attemp to native memory acees where detected. This patch tries to follow up inline_unsafe_access and insert bar only if can't determine if it's heap or off-heap (type missmatch cases are not ported).
>> 
>> # Testing
>> Memory tests should include rollbacking JDK changes, and leaving only hotspot, as intrinsics should be well guarded
>> 
>> # Notes
>> Polluted cases to be addressed later
>> 
>> # Benchmarks
>> 
>> Benchmark                                (size)  Mode  Cnt    Score   Error  Units
>> ByteBufferVectorAccess.arrays              1024  avgt   10   12.585 ? 0.409  ns/op
>> ByteBufferVectorAccess.directBuffers       1024  avgt   10   19.962 ? 0.080  ns/op
>> ByteBufferVectorAccess.heapBuffers         1024  avgt   10   15.878 ? 0.187  ns/op
>> ByteBufferVectorAccess.pollutedBuffers2    1024  avgt   10  123.702 ? 0.723  ns/op
>> ByteBufferVectorAccess.pollutedBuffers3    1024  avgt   10  223.928 ? 1.906  ns/op
>> 
>> Before
>> 
>> Benchmark                                (size)  Mode  Cnt    Score   Error  Units
>> ByteBufferVectorAccess.arrays              1024  avgt   10   14.730 ? 0.061  ns/op
>> ByteBufferVectorAccess.directBuffers       1024  avgt   10   77.707 ? 4.867  ns/op
>> ByteBufferVectorAccess.heapBuffers         1024  avgt   10   76.530 ? 1.076  ns/op
>> ByteBufferVectorAccess.pollutedBuffers2    1024  avgt   10  143.331 ? 1.096  ns/op
>> ByteBufferVectorAccess.pollutedBuffers3    1024  avgt   10  286.645 ? 3.444  ns/op
>
> Radoslaw Smogura has updated the pull request incrementally with two additional commits since the last revision:
> 
>  - Revert: gitignore(s)
>  - CR changes:
>    * reformat checks
>    * bring array mismatched access back

> > for a store we could assign result of StoreVector to two slices raw, and byte[] in a memory merge node,
> 
> I don't see how it could work with the alias analysis (as it is implemented now).
> Every memory slice is "flattened" into a unique slice which doesn't alias with anything except the one represented with `TypePtr::BOTTOM`. What you suggest implies that some slices start to alias with raw memory. It will break the existing logic unless you find a smart way to fix it.
> 
> > for a load, we could consume the whole memory as input, instead of a single slice.
> 
> Still, you need to be very cautious about the alias index being assigned to the "wide" memory slice of mixed/mismatched access. Also, the logic which inserts anti-dependencies in the graph has to be taught about the aliasing slices.
> 
> Overall, it looks error-prone and it wouldn't necessarily lead to a simpler IR (and better generated code) compared to CPU members.

There was a comment in memnode.cpp: "A merge can take a "wide" memory state as one of its narrow inputs. This simply means that the merge observes out only the relevant parts of the wide input (...) (This is rare.)" So, I thought we could mark mixed store / load as delivering wide memory and assign it to only two slices which can be potentially modified (we know that only one slice will be physically modified).

That was more or less one of the reasons I thought we could do this.

-------------

PR: https://git.openjdk.java.net/panama-vector/pull/104