Exposing concrete types of segments and addresses
Maurizio Cimadamore
maurizio.cimadamore at oracle.com
Wed Dec 22 22:44:40 UTC 2021
(re-adding panama-dev in case it might help somebody else as well)
Thanks for clarifying the benchmark.
I believe the issue here is the control flow logic inside the method:
all the branches are consuming the same segment.
If you rewrite the method as follows:
```
@Benchmark
public long read() {
int length = this.length;
long result = 0;
long offset = this.offset;
if ((length & Byte.BYTES) != 0) {
var segment = MemorySegment.ofArray(this.array);
result =
Byte.toUnsignedLong(segment.get(ValueLayout.JAVA_BYTE, offset));
offset += Byte.BYTES;
}
if ((length & Short.BYTES) != 0) {
var segment = MemorySegment.ofArray(this.array);
result = (result << Short.SIZE) |
Short.toUnsignedLong(segment.get(ValueLayout.JAVA_SHORT, offset));
offset += Short.BYTES;
}
if ((length & Integer.BYTES) != 0) {
var segment = MemorySegment.ofArray(this.array);
result = (result << Integer.SIZE) |
Integer.toUnsignedLong(segment.get(ValueLayout.JAVA_INT, offset));
}
return result;
}
```
Escape analysis then works correctly, and no allocation occurs. I don't
think using a sharper type (which was your original suggestion) would
help in this case?
Maurizio
On 22/12/2021 02:24, Quân Anh Mai wrote:
> Hi, thank you very much for taking a look at this. The benchmark
> includes me cycling the length variable to have the compiled code
> cover all branches.
>
> @BenchmarkMode(Mode.AverageTime)
> @OutputTimeUnit(TimeUnit.NANOSECONDS)
> @State(Scope.Benchmark)
> @Fork(1)
> @Warmup(iterations = 8)
> @Measurement(iterations = 8)
> public class Sample {
> int length;
> int offset = 100;
> byte[] array = new byte[200];
>
> @Setup(Level.Iteration)
> public void setup() {
> length = (length + 1) % 8;
> }
>
> @Benchmark
> public long read() {
> int length = this.length;
> var segment = MemorySegment.ofArray(this.array);
> long result = 0;
> long offset = this.offset;
> if ((length & Byte.BYTES) != 0) {
> result =
> Byte.toUnsignedLong(segment.get(ValueLayout.JAVA_BYTE, offset));
> offset += Byte.BYTES;
> }
> if ((length & Short.BYTES) != 0) {
> result = (result << Short.SIZE) |
> Short.toUnsignedLong(segment.get(ValueLayout.JAVA_SHORT, offset));
> offset += Short.BYTES;
> }
> if ((length & Integer.BYTES) != 0) {
> result = (result << Integer.SIZE) |
> Integer.toUnsignedLong(segment.get(ValueLayout.JAVA_INT, offset));
> }
> return result;
> }
> }
>
> The result of the benchmark is as follow:
>
> Benchmark Mode Cnt Score
> Error Units
> Sample.read avgt 8 21.609 ±
> 5.685 ns/op
> Sample.read:·gc.alloc.rate avgt 8 1709.447 ±
> 451.144 MB/sec
> Sample.read:·gc.alloc.rate.norm avgt 8 40.001 ±
> 0.001 B/op
> Sample.read:·gc.churn.G1_Eden_Space avgt 8 1713.039 ±
> 460.429 MB/sec
> Sample.read:·gc.churn.G1_Eden_Space.norm avgt 8 40.080 ±
> 1.549 B/op
> Sample.read:·gc.churn.G1_Survivor_Space avgt 8 0.002 ±
> 0.001 MB/sec
> Sample.read:·gc.churn.G1_Survivor_Space.norm avgt 8 ≈ 10⁻⁴
> B/op
> Sample.read:·gc.count avgt 8 250.000
> counts
> Sample.read:·gc.time avgt 8 24047.000
> ms
>
> It seems that if I have the length value unchanged the allocation is
> indeed eliminated as expected
>
> Benchmark Mode Cnt Score Error Units
> Sample.read avgt 8 6.180 ± 0.887 ns/op
> Sample.read:·gc.alloc.rate avgt 8 ≈ 10⁻⁴ MB/sec
> Sample.read:·gc.alloc.rate.norm avgt 8 ≈ 10⁻⁷ B/op
> Sample.read:·gc.count avgt 8 ≈ 0 counts
>
> Regards,
> Quan Anh
>
> On Wed, 22 Dec 2021 at 05:25, Maurizio Cimadamore
> <maurizio.cimadamore at oracle.com
> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>
> Hi,
> I tried your benchmark - I had to fill in some gaps - so I came up
> with the following enclosing class, which might or might not be
> the similar to the one you are playing with:
>
> ```
> public class TestRead {
>
> byte[] array = new byte[1024];
> int length = 7; // worst case?
> int offset = 16;
>
> @Benchmark
> public long read() {
> ...
> }
> }
>
> ```
>
> I then run the benchmark with "-prof gc" and the allocation rate
> seems very low for the warmup iterations and the first few
> iteration (0.270Mb/sec), then it drops to zero on subsequent
> iterations. It seems to me that (with all usual caveats of this
> being only a synthetic benchmark), this one is working relatively
> well?
>
> Here are the results I got:
>
> ```
> Benchmark Mode Cnt Score Error Units
> TestRead.read avgt 30 4.687 ? 0.120 ns/op
> TestRead.read:?gc.alloc.rate avgt 30 0.110 ? 0.091 MB/sec
> TestRead.read:?gc.alloc.rate.norm avgt 30 0.001 ? 0.001 B/op
> TestRead.read:?gc.count avgt 30 ? 0 counts
> ```
>
> Here the gc seems to not run at all, and the overall allocation
> rate is very very low (this is probably obtained combining the low
> allocation rate in the first few iterations with the non-existent
> allocation rate of the last few iterations).
>
> I'm sure I'm probably not replicating your benchmark correctly (I
> tried with different values of "length" to make the code take
> different branches, to no avail) - if I am, what I see doesn't
> seem to suggest that GC is acting as a bottleneck here?
>
> Cheers
> Maurizio
>
>
>
> On 21/12/2021 14:51, Quân Anh Mai wrote:
>> Thank you very much for the detailed explanation, I agree that we
>> need to be patient as adding more types to the API is easier than
>> removing those. I can imagine that later on, we can expose only
>> HeapMemorySegment<T>, NativeMemorySegment and MappedMemorySegment
>> if it is forced to do so.
>>
>> Regarding a non-optimal circumstance, I discovered an interesting
>> case where I want to read a long value from a byte array given
>> the read bytes might be less than 8. The benchmark is as follow:
>>
>> @Benchmark
>> public long read() {
>> int length = this.length;
>> var segment = MemorySegment.ofArray(this.array);
>> long result = 0;
>> long offset = this.offset;
>> if ((length & Byte.BYTES) != 0) {
>> result =
>> Byte.toUnsignedLong(segment.get(ValueLayout.JAVA_BYTE, offset));
>> offset += Byte.BYTES;
>> }
>> if ((length & Short.BYTES) != 0) {
>> result = (result << Short.SIZE) |
>> Short.toUnsignedLong(segment.get(ValueLayout.JAVA_SHORT, offset));
>> offset += Short.BYTES;
>> }
>> if ((length & Integer.BYTES) != 0) {
>> result = (result << Integer.SIZE) |
>> Integer.toUnsignedLong(segment.get(ValueLayout.JAVA_INT, offset));
>> }
>> return result;
>> }
>>
>> Running with a fairly recent revision of openjdk/jdk (the
>> difference is 12 commits as of right now, which means the running
>> JVM contains the fix for your mentioned bug already), the
>> generated assembly seems to be not optimal, with the segment
>> failing to be scalarized.
>>
>> Regards,
>> Quan Anh
>>
>> On Tue, 21 Dec 2021 at 05:29, Maurizio Cimadamore
>> <maurizio.cimadamore at oracle.com
>> <mailto:maurizio.cimadamore at oracle.com>> wrote:
>>
>> Hi,
>> thanks for your email. This is a really tricky area, where no
>> optimal
>> solution exists yet.
>>
>> First, we have recently spotted an issue with escape analysis
>> not
>> working correctly with memory segments - for this I filed the
>> issue:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8278429
>> <https://bugs.openjdk.java.net/browse/JDK-8278429>
>>
>> Which has been closed as a duplicate of another VM bug which
>> is being
>> worked on. I believe that fix should generally improve all
>> scenario
>> where there is a bottleneck due to failure of scalarization when
>> creating new segments (e.g. slicing).
>>
>> That said, this does not address your fundamental point that,
>> at the end
>> of the day, some of these optimizations depend on the ability
>> of C2 to
>> inline through code (but this is also true for the ByteBuffer
>> API).
>>
>> The ultimate solution would be IMHO to make memory segment
>> _less_
>> polymorphic, by having a single implementation class which then
>> delegates its memory access behavior to a secondary
>> abstraction (which
>> could be a constant, based on the access type: on-heap, or
>> off-heap).
>>
>> If we did that, a memory segment would become a dumb wrapper
>> around a
>> base object, a length and some (constant) access object helper.
>>
>> Unfortunately this solution (which we have tried) doesn't
>> work - because
>> Unsafe memory access needs to know whether access is going to
>> be on- or
>> off-heap (in order to remove important memory barriers).
>> Currently this
>> is done with the help of type profiling: if we are accessing
>> memory on a
>> type that C2 can prove to be "NativeMemorySegmentImpl", then
>> C2 also
>> knows that access is going to be off-heap - and unsafe access
>> is fast.
>> To have profiling working correctly we need one concrete
>> segment type
>> for each possible access type (native, mapped, and one for each
>> primitive on-heap array). But if there's only one concrete
>> type, there's
>> no type profiling to go on, so we gain monomorphism, but we
>> lose (very
>> badly) when it comes to profile pollution exposure. To fix
>> this, we need
>> better ways to do type profiling (based not only on
>> receiver/parameter
>> types, but maybe the type of some fields in an instance).
>>
>> Now, in the current implementation we can hide the
>> polymorphism, pretty
>> much like ByteBuffer does, under a common interface. Exposing
>> concrete
>> types as you suggest is going to be painful - as users will
>> see another
>> 9 more segment types (7 primitive arrays, + mapped + native),
>> which
>> would increase the size of the API quite considerably. Maybe
>> some
>> intermediate point might also be useful to consider (e.g.
>> perhaps only
>> two types - for native segments and heap segments, but do not
>> differentiate between mapped/native or between byte[] and
>> long[] in the
>> public API). But we need to conder any such moves very
>> carefully: while
>> we can add these types very easily in the future, if it
>> proves to be the
>> only possible path (e.g. even after Valhalla) in order to use
>> memory
>> segments sanely, the reverse is not true: if we add these new
>> types now,
>> and later on we discover these new types to be superseded by
>> some new VM
>> optimization, or better support thanks to Valhalla, we'd be
>> stuck with
>> these types for a long time.
>>
>> I think at this point in time we'd like to know where the
>> performance
>> potholes are - so if you happen to have a benchmark which
>> shows the
>> problem you discussed, we'd be very happy to take a look. Our
>> experience
>> so far seems to suggest that performance is acceptable - even
>> in cases
>> where segments are created in very hot paths (we do have a
>> spliterator
>> test which indundates the system with slices - and that
>> doesn't seem to
>> perform too bad). At the same time, I can believe you when
>> you say that
>> some of the optimizations we might rely upon are fraglie
>> (I've been
>> there when using the API on my own, so the mileage of certain
>> idioms can
>> vary).
>>
>> Unfortunately this is a bigger problem IMHO than just
>> MemorySegments:
>> currently writing immutable APIs in Java can lead to spotty
>> performance.
>> The hope is that Valhalla will give us tools to help us
>> manage that kind
>> of complexity - but even then, some of the optimizations (e.g.
>> scalarization) might be gated by excessive polymorphism
>> and/or lack of
>> inlining. If we can improve the VM enough to do the type
>> profiling we
>> need to keep unsafe access sharp even in the face of a
>> "monomorphic"
>> implementation, then I believe the current API could take
>> advantage of
>> Valhalla in a more straightfoward fashion (and we could, in
>> the future,
>> add Valhalla optimizations to special case treatment for sealed
>> interfaces whose only implementation is a primitive class).
>>
>> [Btw, this discussion is really about MemorySegment - for
>> MemoryAddress,
>> in my own experiments I could already see Valhalla making
>> quick work of
>> all the address instantiations - as MemoryAddressImpl is the
>> only
>> implementation of MemoryAddress].
>>
>> Maurizio
>>
>>
>> On 20/12/2021 07:21, Quân Anh Mai wrote:
>> > Hi,
>> >
>> > Currently, we can only access MemorySegments and
>> MemoryAddresses through
>> > the respective interface. While this provides a nice
>> interface for all
>> > kinds of memory segments, the lack of ability to use the
>> concrete types
>> > leads to a lot of performance caveats.
>> >
>> > Firstly, polymorphism disables scalarization. While a
>> non-escaped object
>> > can be scalarized in most cases, there are still
>> circumstances that scalar
>> > replacement fails (e.g when we want to continuously slice a
>> segment in a
>> > loop). Furthermore, this makes us become dependent on the
>> inlining ability
>> > of the compiler, which is unpredictable and limits the use
>> of segments and
>> > addresses for desired performance. On the other hand,
>> scalarization of
>> > polymorphic types in fields and calling convention seems to
>> be really
>> > really complicated. With primitive classes, we could make
>> the performance
>> > of foreign API become much more predictable with the
>> elimination of
>> > allocations as well as pointer chasings where we can and
>> want to limit the
>> > kind of segment we operate on.
>> >
>> > The above caveats lead to possible usage of foreign API to
>> pass around the
>> > naked addresses as long values and only construct segments
>> where it is
>> > needed. This approach, while being an ugly hack, is still
>> not ideal cause
>> > multiple methods may fail to be inlined.
>> >
>> > Secondly, polymorphism limits specialisation. With JEP 218,
>> we may have
>> > multiple specialisations of the same methods operating on
>> different kinds
>> > of segments. While it is still possible, to some extent, to
>> have
>> > specialisation with a polymorphic type MemorySegment, it
>> would likely be a
>> > fragile optimisation that relies on inlining and a lot of
>> type checks.
>> >
>> > Furthermore, while having common aspects, MemorySegments
>> expose different
>> > behaviours on the others. E.g. HeapMemorySegment is not
>> Addressable,
>> > MappedMemorySegment has various additional specific
>> methods. While this is
>> > not an argument for the design of foreign API, it is a
>> small bonus point
>> > over those above.
>> >
>> > Overall, the current status of foreign API seems to put us
>> in a position
>> > that relies too much on the compiler to get the desired
>> performance.
>> > Exposing the concrete types would enable us to write more
>> predictable codes
>> > where it needs to and flexible code (i.e using polymorphic
>> MemotySegment,
>> > MemoryAddress, etc) where it is more desirable.
>> >
>> > My apologies if this question has been addressed before.
>> Thank you very
>> > much.
>> > Quan Anh
>>
More information about the panama-dev
mailing list