[foreign-memaccess+abi] RFR: 8264933: Improve stream support in memory segments

Fri Apr 9 12:28:26 UTC 2021

On Fri, 9 Apr 2021 12:05:33 GMT, Maurizio Cimadamore <mcimadamore at openjdk.org> wrote:

>> I fail to see how a Spliterator that asks a stream to do has many recursive calls/creates as many sub-segments has the number of elements is a good idea ?
>
>> I fail to see how a Spliterator that asks a stream to do has many recursive calls/creates as many sub-segments has the number of elements is a good idea ?
> 
> This is something that has been present since the second iteration of the API. And, it can be used quite effectively. We have benchmarks (see ParallelSum) which use fork join recursive actions to do a parallel sum of the contents of a segment. Provided that the "size of the element" is chosen appropriately, the speedup obtained is rather nice:
> 
> Benchmark                   Mode  Cnt   Score   Error  Units
> ParallelSum.segment_serial  avgt   30  86.004 ? 0.941  ms/op
> 
> vs:
> 
> Benchmark                                 Mode  Cnt   Score   Error  Units
> ParallelSum.segment_stream_parallel       avgt   30  45.211 ? 1.105  ms/op
> ParallelSum.segment_stream_parallel_bulk  avgt   30  23.057 ? 0.353  ms/op
> 
> Here we compare a serial sum of all the elements in a segment (with a flat for loop) against a parallel sum using parallel streams; in the first parallel benchmark, we use a split size that is the same as the element size (e.g. 4 bytes); while this improves throughput, it is not ideal as too many intermediate segments are created. But that's why the spliterator/stream methods accept an element layout: you can also specify a "bulk element" (e.g. 1024 ints each) and then process these in parallel. As you can see the speed up increases another 2x by doing this.
> 
> Considering how little code is needed to write this:
> 
>    @Benchmark
>     public int segment_stream_parallel_bulk() {
>         return segment.parallelStream(ELEM_LAYOUT_BULK).mapToInt(SEGMENT_TO_INT_BULK).sum();
>     }
> 
> I think this makes sense; of course it's not a silver bullet, and has to be handled with care, but here we assume the audience knows what they are doing.
> 
> Apart from parallel processing, turning a segment into a stream of slices is also useful to perform ad-hoc marshalling/unmarshalling - as written in the panama-dev email:
> 
> segment.stream(C_POINTER)
>        .map(MemoryAccess::getAddress)
>        .map(CLinker::toJavaString)
>        .toArray(String[]::new);
>  
> 
> In this case, we're not after performances - we just want to express more directly what would otherwise be expressed using a big for loop.

So not using the "bulk" version leads to less performance but because it's a Stream of MemorySegment, you have to know as a user if you use the bulk version or not.

I think i would prefer the stream to be a stream of the addresses,a LongStream instead of being a Stream of MemorySegment, so your last example can be written
  segment.addresses(C_POINTER)
       .map(CLinker::toJavaString)
       .toArray(String[]::new);
Here, the element is just there to get its size.

Having a stream of offsets/addresses will also work better with VarHandles, by example
  segment.addresses(AN_INT)
       .mapToInt(offset -> HANDLE.getInt(segment, offset))
       .sum();

-------------

PR: https://git.openjdk.java.net/panama-foreign/pull/494