AW: Using MemoryAccess with structured MemoryLayout

Fri Feb 26 11:21:41 UTC 2021

On Thu, 2021-02-25 at 17:51 -0600, Ty Young wrote:
> What specifically makes a higher-level Panama API slow beside
> handles 
> not being static-final? Is something like this:

That would be the main thing, and unfortunately, is a pretty big thing.
But please, don't take my word for it - luckily there's a way to
measure these kind of claims; this is an extract from a benchmark I
tweaked on the fly to prove the point:

```
static final VarHandle VH_int = MemoryLayout.ofSequence(JAVA_INT).varHandle(int.class, sequenceElement());
VarHandle VH_int_non_final = MemoryLayout.ofSequence(JAVA_INT).varHandle(int.class, sequenceElement());

    @Benchmark
    public int segment_loop() {
        int sum = 0;
        for (int i = 0; i < ELEM_SIZE; i++) {
            sum += (int) VH_int.get(segment, (long) i);
        }
        return sum;
    }

    @Benchmark
    public int segment_loop_non_final() {
        int sum = 0;
        for (int i = 0; i < ELEM_SIZE; i++) {
            sum += (int) VH_int_non_final.get(segment, (long) i);
        }
        return sum;
    }
```

And here are the results:

Benchmark                                   Mode  Cnt   Score   Error  Units
LoopOverNonConstant.segment_loop            avgt   30   0.230 ? 0.001  ms/op
LoopOverNonConstant.segment_loop_non_final  avgt   30  23.689 ? 0.282  ms/op

In other words, unless you do some heroics (like Remi has shown - by
using mutable callsites, relying on constantness of parameters, as well
as using records to get to "trusted instance finals" - all of which is
great, but probably fragile also), the non-static-final version is 100x
 (!!) slower (I realize I have been too conservative yesterday when I
said 10x).

So, that's the kind of cost we're staring at. If it was 1.5x of course
we'd have more options - but the difference is rather massive. Also
let's not forget that one of the goals here is for the memory access
API to be a valid replacement for Unsafe::getInt and friends - which
are low-level capabilities whose performance is similar to the var
handle "final static" version above; we would fail in providing an
alternative if what we produced was 100x slower than Unsafe!

>  
> 
> NativeIntegerArray array = new NativeIntegerArray(5);
> 
>          array.setPrimitive(0, 1);
>          array.setPrimitive(1, 2);
>          array.setPrimitive(2, 3);
>          array.setPrimitive(3, 4);
>          array.setPrimitive(4, 5);
> 
>          for(int i = 0; i < array.getSize(); i++)
>              System.out.println(array.getObject(i));
> 
> 
> with setPrimitive defined as:
> 
> 
> public void setPrimitive(long index, int value)
> {
>          super.handle.set(this.getSegment(), index * 
> this.getElementSize(), value);
> }
> 
> 
> and getObject as:
> 
> 
> public Integer getObject(long index)
> {
>          return (Integer)super.handle.get(this.getSegment(), index * 
> this.getElementSize());
> }
> 
> 
> (getSegment() and getElementSIze() are just simple returns)
> 
> 
> Not as fast as MemoryAccess, assuming the handle is static-
> final(which 
> it is)? 

The problem is that the handle is constant "at the beginning" but then
you store it into an instance field - and when you call e.g.
setPrimitive you have to load the handle from that instance (non
constant) field. This load is not trusted, hence the access will be
slow.

One workaround is to do what Remi did - declare your class as a record
- then final fields become trusted, so if the handle you store is
constant, constan-ness will be propagated. This might be a fine
approach depending on the circumstances - I'd personally wait a bit
more until the VM is ready to deal and optimize "instance final" fields
in a more general way than what it does now:

https://bugs.openjdk.java.net/browse/JDK-8233873

> Or are you saying that a super concise Pointer API specifically 
> could never be fast? You mention casting, can you go into more detail
> on 
> what you mean?

First - what I'm saying is that a Pointer API (no matter how concise)
is not a "primitive" concept. It's darn useful for interacting with C
code, but it's annoying to have to deal with when you don't care about
C code.

For instance, in the previous Pointer API, it was not uncommon to
create a pointer to a blob of bytes (e.g. think of it as a char*).

Then you had to offset the pointer 42 bytes, and read a long. How do
you do that?

pointer.offset(42).cast(LayoutType.LONG).get();

(the old API wasn't exactly like that, but you get the spirit).

When working with heterogeneous, buffer-like contents (think of it as
trying to use the Pointer API as if it was a ByteBuffer), this idiom
was popping up all over the place, and make it tedious to code with the
API (in fact, nobody really considered using the Pointer API as a
replacement for ByteBuffer in Netty/Lucene and others).

That's when we realized that, in order to provide an API which worked
at the right level of abstraction, we had to go lower: C pointers were
an attractive starting point, but, sadly, not the memory access
primitive we were looking for. For a framework as ubiquitous as the
JDK, "return on investment" is an important metric; a Pointer-like API
seemed like a dead-end in that respect: good when working with C code,
completely inadequate when moving to other off-heap use cases.

Structured memory access with VarHandles and segments, whether people
like playing with var handles or not, are efficient and general enough
- that means that it opens up the possibility to (re)build a Pointer
API on top of that. And, when the time is right (better VM
optimizations, Valhalla primitive classes) I believe we can and we will
have such a class. 

But let's not forget that, by going lower level, not only what we have
is way more efficient of what we had before, but we also opened up
Panama to a relatively sized audience (off-heap, ND-array, ...) who
didn't necessarily know what to make of Panama.

Cheers
Maurizio

> 
> 
> On 2/25/21 12:12 PM, Maurizio Cimadamore wrote:
> > Second, if a notion of layout is always associated with a segment,
> > you
> > end up in a place where, in order to slice a segment, you probably
> > have
> > to follow that operation with some kind of "cast" (e.g. where you
> > set
> > the layout of the slice to something else). We've been there with a
> > past incarnation of the Panama API, and, while an API like the one
> > you
> > describe is probably more suited to closely model a C pointer type,
> > that API is not very "primitive" - meaning that it is quite useless
> > if
> > you start using a memory segment in a more buffer-like way.
> > 
> > Note that not _all_ the users of the Memory Access API are
> > interested
> > in native interop - many just want to be able to allocate slabs of
> > native memory, and free deterministically. So, the more baggage we
> > add,
> > the more those non-linker use cases become bloated with unnecessary
> > overhead.
> > 
> > ...
> > 
> > The fine line we're walking in this project is to expose the tools
> > and
> > the knob which allow clients to perform memory access/foreign
> > function
> > access in the fastest possible way we know of/is possible within
> > the
> > JVM. To do that, sometimes (not always) we have to "look the other
> > way"
> > when it comes to usability - simply because it would be impossible
> > to
> > have an API that is both 100% efficient and 100% usable.
> > 
> > 
> > Even at the jextract level, we are aware that some people would
> > expect
> > an API that is closer to the C world (e.g. a `Pointer` type? Struct
> > wrappers?) - but again here our approach is to enable people to
> > write
> > code which targets the library they wanna use quickly (e.g. way
> > faster
> > than using JNI), but w/o introducing unnecessary translation steps
> > in
> > the middle - which would make the bindings too slow for some
> > advanced
> > use cases.
> > 
> > I apologize for the (too) big reply - I hope you find it helpful to
> > understand the "why not" part of your earlier question.
> > 
> > Cheers
> > Maurizio
> 
>