Exposing concrete types of segments and addresses

Wed Dec 22 22:44:40 UTC 2021

(re-adding panama-dev in case it might help somebody else as well)

Thanks for clarifying the benchmark.

I believe the issue here is the control flow logic inside the method: 
all the branches are consuming the same segment.

If you rewrite the method as follows:

```
@Benchmark
     public long read()  {
         int length = this.length;
         long result = 0;
         long offset = this.offset;
         if ((length & Byte.BYTES) != 0) {
             var segment = MemorySegment.ofArray(this.array);
             result = 
Byte.toUnsignedLong(segment.get(ValueLayout.JAVA_BYTE, offset));
             offset += Byte.BYTES;
         }
         if ((length & Short.BYTES) != 0) {
             var segment = MemorySegment.ofArray(this.array);
             result = (result << Short.SIZE) | 
Short.toUnsignedLong(segment.get(ValueLayout.JAVA_SHORT, offset));
             offset += Short.BYTES;
         }
         if ((length & Integer.BYTES) != 0) {
             var segment = MemorySegment.ofArray(this.array);
             result = (result << Integer.SIZE) | 
Integer.toUnsignedLong(segment.get(ValueLayout.JAVA_INT, offset));
         }
         return result;
     }
```

Escape analysis then works correctly, and no allocation occurs. I don't 
think using a sharper type (which was your original suggestion) would 
help in this case?

Maurizio

On 22/12/2021 02:24, Quân Anh Mai wrote:
> Hi, thank you very much for taking a look at this. The benchmark 
> includes me cycling the length variable to have the compiled code 
> cover all branches.
>
> @BenchmarkMode(Mode.AverageTime)
> @OutputTimeUnit(TimeUnit.NANOSECONDS)
> @State(Scope.Benchmark)
> @Fork(1)
> @Warmup(iterations = 8)
> @Measurement(iterations = 8)
> public class Sample {
>     int length;
>     int offset = 100;
>     byte[] array = new byte[200];
>
>     @Setup(Level.Iteration)
>     public void setup() {
>         length = (length + 1) % 8;
>     }
>
>     @Benchmark
>     public long read()  {
>         int length = this.length;
>         var segment = MemorySegment.ofArray(this.array);
>         long result = 0;
>         long offset = this.offset;
>         if ((length & Byte.BYTES) != 0) {
>             result = 
> Byte.toUnsignedLong(segment.get(ValueLayout.JAVA_BYTE, offset));
>             offset += Byte.BYTES;
>         }
>         if ((length & Short.BYTES) != 0) {
>             result = (result << Short.SIZE) | 
> Short.toUnsignedLong(segment.get(ValueLayout.JAVA_SHORT, offset));
>             offset += Short.BYTES;
>         }
>         if ((length & Integer.BYTES) != 0) {
>             result = (result << Integer.SIZE) | 
> Integer.toUnsignedLong(segment.get(ValueLayout.JAVA_INT, offset));
>         }
>         return result;
>     }
> }
>
> The result of the benchmark is as follow:
>
> Benchmark                                     Mode Cnt      Score    
>  Error   Units
> Sample.read                                   avgt 8     21.609 ±  
>  5.685   ns/op
> Sample.read:·gc.alloc.rate                    avgt 8   1709.447 ± 
> 451.144  MB/sec
> Sample.read:·gc.alloc.rate.norm               avgt 8     40.001 ±  
>  0.001    B/op
> Sample.read:·gc.churn.G1_Eden_Space           avgt 8   1713.039 ± 
> 460.429  MB/sec
> Sample.read:·gc.churn.G1_Eden_Space.norm      avgt 8     40.080 ±  
>  1.549    B/op
> Sample.read:·gc.churn.G1_Survivor_Space       avgt 8      0.002 ±  
>  0.001  MB/sec
> Sample.read:·gc.churn.G1_Survivor_Space.norm  avgt 8     ≈ 10⁻⁴        
>       B/op
> Sample.read:·gc.count                         avgt 8    250.000        
>     counts
> Sample.read:·gc.time                          avgt 8  24047.000        
>         ms
>
> It seems that if I have the length value unchanged the allocation is 
> indeed eliminated as expected
>
> Benchmark                        Mode  Cnt   Score   Error   Units
> Sample.read                      avgt    8   6.180 ±  0.887   ns/op
> Sample.read:·gc.alloc.rate       avgt    8  ≈ 10⁻⁴          MB/sec
> Sample.read:·gc.alloc.rate.norm  avgt    8  ≈ 10⁻⁷            B/op
> Sample.read:·gc.count            avgt    8     ≈ 0          counts
>
> Regards,
> Quan Anh
>
> On Wed, 22 Dec 2021 at 05:25, Maurizio Cimadamore 
> <maurizio.cimadamore at oracle.com 
> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>
>     Hi,
>     I tried your benchmark - I had to fill in some gaps - so I came up
>     with the following enclosing class, which might or might not be
>     the similar to the one you are playing with:
>
>     ```
>     public class TestRead {
>
>         byte[] array = new byte[1024];
>         int length = 7; // worst case?
>         int offset = 16;
>
>         @Benchmark
>         public long read()  {
>             ...
>         }
>     }
>
>     ```
>
>     I then run the benchmark with "-prof gc" and the allocation rate
>     seems very low for the warmup iterations and the first few
>     iteration (0.270Mb/sec), then it drops to zero on subsequent
>     iterations. It seems to me that (with all usual caveats of this
>     being only a synthetic benchmark), this one is working relatively
>     well?
>
>     Here are the results I got:
>
>     ```
>     Benchmark                          Mode  Cnt  Score Error   Units
>     TestRead.read                      avgt   30  4.687 ? 0.120   ns/op
>     TestRead.read:?gc.alloc.rate       avgt   30  0.110 ? 0.091  MB/sec
>     TestRead.read:?gc.alloc.rate.norm  avgt   30  0.001 ? 0.001    B/op
>     TestRead.read:?gc.count            avgt   30    ? 0          counts
>     ```
>
>     Here the gc seems to not run at all, and the overall allocation
>     rate is very very low (this is probably obtained combining the low
>     allocation rate in the first few iterations with the non-existent
>     allocation rate of the last few iterations).
>
>     I'm sure I'm probably not replicating your benchmark correctly (I
>     tried with different values of "length" to make the code take
>     different branches, to no avail) - if I am, what I see doesn't
>     seem to suggest that GC is acting as a bottleneck here?
>
>     Cheers
>     Maurizio
>
>
>
>     On 21/12/2021 14:51, Quân Anh Mai wrote:
>>     Thank you very much for the detailed explanation, I agree that we
>>     need to be patient as adding more types to the API is easier than
>>     removing those. I can imagine that later on, we can expose only
>>     HeapMemorySegment<T>, NativeMemorySegment and MappedMemorySegment
>>     if it is forced to do so.
>>
>>     Regarding a non-optimal circumstance, I discovered an interesting
>>     case where I want to read a long value from a byte array given
>>     the read bytes might be less than 8. The benchmark is as follow:
>>
>>         @Benchmark
>>         public long read()  {
>>             int length = this.length;
>>             var segment = MemorySegment.ofArray(this.array);
>>             long result = 0;
>>             long offset = this.offset;
>>             if ((length & Byte.BYTES) != 0) {
>>                 result =
>>     Byte.toUnsignedLong(segment.get(ValueLayout.JAVA_BYTE, offset));
>>                 offset += Byte.BYTES;
>>             }
>>             if ((length & Short.BYTES) != 0) {
>>                 result = (result << Short.SIZE) |
>>     Short.toUnsignedLong(segment.get(ValueLayout.JAVA_SHORT, offset));
>>                 offset += Short.BYTES;
>>             }
>>             if ((length & Integer.BYTES) != 0) {
>>                 result = (result << Integer.SIZE) |
>>     Integer.toUnsignedLong(segment.get(ValueLayout.JAVA_INT, offset));
>>             }
>>             return result;
>>         }
>>
>>     Running with a fairly recent revision of openjdk/jdk (the
>>     difference is 12 commits as of right now, which means the running
>>     JVM contains the fix for your mentioned bug already), the
>>     generated assembly seems to be not optimal, with the segment
>>     failing to be scalarized.
>>
>>     Regards,
>>     Quan Anh
>>
>>     On Tue, 21 Dec 2021 at 05:29, Maurizio Cimadamore
>>     <maurizio.cimadamore at oracle.com
>>     <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>
>>         Hi,
>>         thanks for your email. This is a really tricky area, where no
>>         optimal
>>         solution exists yet.
>>
>>         First, we have recently spotted an issue with escape analysis
>>         not
>>         working correctly with memory segments - for this I filed the
>>         issue:
>>
>>         https://bugs.openjdk.java.net/browse/JDK-8278429
>>         <https://bugs.openjdk.java.net/browse/JDK-8278429>
>>
>>         Which has been closed as a duplicate of another VM bug which
>>         is being
>>         worked on. I believe that fix should generally improve all
>>         scenario
>>         where there is a bottleneck due to failure of scalarization when
>>         creating new segments (e.g. slicing).
>>
>>         That said, this does not address your fundamental point that,
>>         at the end
>>         of the day, some of these optimizations depend on the ability
>>         of C2 to
>>         inline through code (but this is also true for the ByteBuffer
>>         API).
>>
>>         The ultimate solution would be IMHO to make memory segment
>>         _less_
>>         polymorphic, by having a single implementation class which then
>>         delegates its memory access behavior to a secondary
>>         abstraction (which
>>         could be a constant, based on the access type: on-heap, or
>>         off-heap).
>>
>>         If we did that, a memory segment would become a dumb wrapper
>>         around a
>>         base object, a length and some (constant) access object helper.
>>
>>         Unfortunately this solution (which we have tried) doesn't
>>         work - because
>>         Unsafe memory access needs to know whether access is going to
>>         be on- or
>>         off-heap (in order to remove important memory barriers).
>>         Currently this
>>         is done with the help of type profiling: if we are accessing
>>         memory on a
>>         type that C2 can prove to be "NativeMemorySegmentImpl", then
>>         C2 also
>>         knows that access is going to be off-heap - and unsafe access
>>         is fast.
>>         To have profiling working correctly we need one concrete
>>         segment type
>>         for each possible access type (native, mapped, and one for each
>>         primitive on-heap array). But if there's only one concrete
>>         type, there's
>>         no type profiling to go on, so we gain monomorphism, but we
>>         lose (very
>>         badly) when it comes to profile pollution exposure. To fix
>>         this, we need
>>         better ways to do type profiling (based not only on
>>         receiver/parameter
>>         types, but maybe the type of some fields in an instance).
>>
>>         Now, in the current implementation we can hide the
>>         polymorphism, pretty
>>         much like ByteBuffer does, under a common interface. Exposing
>>         concrete
>>         types as you suggest is going to be painful - as users will
>>         see another
>>         9 more segment types (7 primitive arrays, + mapped + native),
>>         which
>>         would increase the size of the API quite considerably. Maybe
>>         some
>>         intermediate point might also be useful to consider (e.g.
>>         perhaps only
>>         two types - for native segments and heap segments, but do not
>>         differentiate between mapped/native or between byte[] and
>>         long[] in the
>>         public API). But we need to conder any such moves very
>>         carefully: while
>>         we can add these types very easily in the future, if it
>>         proves to be the
>>         only possible path (e.g. even after Valhalla) in order to use
>>         memory
>>         segments sanely, the reverse is not true: if we add these new
>>         types now,
>>         and later on we discover these new types to be superseded by
>>         some new VM
>>         optimization, or better support thanks to Valhalla, we'd be
>>         stuck with
>>         these types for a long time.
>>
>>         I think at this point in time we'd like to know where the
>>         performance
>>         potholes are - so if you happen to have a benchmark which
>>         shows the
>>         problem you discussed, we'd be very happy to take a look. Our
>>         experience
>>         so far seems to suggest that performance is acceptable - even
>>         in cases
>>         where segments are created in very hot paths (we do have a
>>         spliterator
>>         test which indundates the system with slices - and that
>>         doesn't seem to
>>         perform too bad). At the same time, I can believe you when
>>         you say that
>>         some of the optimizations we might rely upon are fraglie
>>         (I've been
>>         there when using the API on my own, so the mileage of certain
>>         idioms can
>>         vary).
>>
>>         Unfortunately this is a bigger problem IMHO than just
>>         MemorySegments:
>>         currently writing immutable APIs in Java can lead to spotty
>>         performance.
>>         The hope is that Valhalla will give us tools to help us
>>         manage that kind
>>         of complexity - but even then, some of the optimizations (e.g.
>>         scalarization) might be gated by excessive polymorphism
>>         and/or lack of
>>         inlining. If we can improve the VM enough to do the type
>>         profiling we
>>         need to keep unsafe access sharp even in the face of a
>>         "monomorphic"
>>         implementation, then I believe the current API could take
>>         advantage of
>>         Valhalla in a more straightfoward fashion (and we could, in
>>         the future,
>>         add Valhalla optimizations to special case treatment for sealed
>>         interfaces whose only implementation is a primitive class).
>>
>>         [Btw, this discussion is really about MemorySegment - for
>>         MemoryAddress,
>>         in my own experiments I could already see Valhalla making
>>         quick work of
>>         all the address instantiations - as MemoryAddressImpl is the
>>         only
>>         implementation of MemoryAddress].
>>
>>         Maurizio
>>
>>
>>         On 20/12/2021 07:21, Quân Anh Mai wrote:
>>         > Hi,
>>         >
>>         > Currently, we can only access MemorySegments and
>>         MemoryAddresses through
>>         > the respective interface. While this provides a nice
>>         interface for all
>>         > kinds of memory segments, the lack of ability to use the
>>         concrete types
>>         > leads to a lot of performance caveats.
>>         >
>>         > Firstly, polymorphism disables scalarization. While a
>>         non-escaped object
>>         > can be scalarized in most cases, there are still
>>         circumstances that scalar
>>         > replacement fails (e.g when we want to continuously slice a
>>         segment in a
>>         > loop). Furthermore, this makes us become dependent on the
>>         inlining ability
>>         > of the compiler, which is unpredictable and limits the use
>>         of segments and
>>         > addresses for desired performance. On the other hand,
>>         scalarization of
>>         > polymorphic types in fields and calling convention seems to
>>         be really
>>         > really complicated. With primitive classes, we could make
>>         the performance
>>         > of foreign API become much more predictable with the
>>         elimination of
>>         > allocations as well as pointer chasings where we can and
>>         want to limit the
>>         > kind of segment we operate on.
>>         >
>>         > The above caveats lead to possible usage of foreign API to
>>         pass around the
>>         > naked addresses as long values and only construct segments
>>         where it is
>>         > needed. This approach, while being an ugly hack, is still
>>         not ideal cause
>>         > multiple methods may fail to be inlined.
>>         >
>>         > Secondly, polymorphism limits specialisation. With JEP 218,
>>         we may have
>>         > multiple specialisations of the same methods operating on
>>         different kinds
>>         > of segments. While it is still possible, to some extent, to
>>         have
>>         > specialisation with a polymorphic type MemorySegment, it
>>         would likely be a
>>         > fragile optimisation that relies on inlining and a lot of
>>         type checks.
>>         >
>>         > Furthermore, while having common aspects, MemorySegments
>>         expose different
>>         > behaviours on the others. E.g. HeapMemorySegment is not
>>         Addressable,
>>         > MappedMemorySegment has various additional specific
>>         methods. While this is
>>         > not an argument for the design of foreign API, it is a
>>         small bonus point
>>         > over those above.
>>         >
>>         > Overall, the current status of foreign API seems to put us
>>         in a position
>>         > that relies too much on the compiler to get the desired
>>         performance.
>>         > Exposing the concrete types would enable us to write more
>>         predictable codes
>>         > where it needs to and flexible code (i.e using polymorphic
>>         MemotySegment,
>>         > MemoryAddress, etc) where it is more desirable.
>>         >
>>         > My apologies if this question has been addressed before.
>>         Thank you very
>>         > much.
>>         > Quan Anh
>>