[foreign] Some benchmarks for Foreign vs JNA (vs pure java implementation)

Mon Mar 11 16:26:11 UTC 2019

For 4. we already have an Array::toArray API, but this creates a new 
array (maybe that's enough here?) We could easily add an overload that 
allows a user to pass in a pre-existing Java array.

The underlying implementation in BoundedArray [1] doesn't look optimized 
to do bulk copies currently, but I guess it could be for cases where the 
array element type has the exact same layout as the native type.

Jorn

[1] : 
http://hg.openjdk.java.net/panama/dev/file/4aaccf806020/src/java.base/share/classes/jdk/internal/foreign/memory/BoundedArray.java#l66

Maurizio Cimadamore schreef op 2019-03-11 15:17:
> Hi Lev,
> thanks for the interesting data points. The fact that native calls are
> not faster than JNI-based one is to be expected at this stage, given
> that ourt backend is essentially JNI-based; so the results in the
> FFT-only case make sense.
> 
> As for the array copy issue, I have few comments:
> 
> 1) I suspect the result you get might vary with the benchmark - in
> this case, of course you are getting hit badly, because you want to go
> from Java array to native array and back - so the Panama code does a
> lot more computation than the other examples, which unsurprisingly
> shows up in the benchmark
> 
> 2) Where Array starts to pay off is in cases where e.g. you read an
> array from a native struct and then you pass on a pointer to it to a
> native call; I suspect that in these cases the fact that need for
> JNA-like framework to materialize the Java array eagerly might result
> in opposite benchmark results (because there's an extra garbage array
> being created, which Panama does not create).
> 
> 3) You seem to make the assumption that the natural way to encode
> input/outputs (see the FFTState class) is using Java arrays. Maybe an
> actual Panama-based implementation will encode something like FFTState
> as an interface, which accessors to get at the result [i][j] - in a
> way that _no array at all is ever materialized_. This should even be
> better than JNA - where you have to have some backing java array for
> the output data (which requires GC activity). In other words, the main
> point of Panama array is to avoid materializing a true array - and
> expose access to underlying data lazily. Of course there are cases
> where people might just want to turn these into regular Java arrays,
> and perhaps these cases should be made more efficient (see below) -
> but it seem to me that forcing an impl to work in terms of Java arrays
> is missing half the point of using Panama array in the first place.
> 
> 4) Internally Panama supports creating Pointer and even Array that are
> based 'on heap'. It is not hard, in principle, to expose an API to
> create an Array view out of an heap double[] - and then do a bulk copy
> (Array::assign) between the on-heap and the off-heap version (which
> will surely be faster than what you are doing now, and probably
> similar to using a DoubleBuffer).
> 
> 
> So, I tend to view the glass as half-full - surely there are things we
> should work more on, probably offering short-term migration paths like
> (4) would be a good place to start.
> 
> Maurizio
> 
> 
> On 11/03/2019 13:32, Lev Serebryakov wrote:
>>   TL;DR VERSION: Panama/foreign is not faster then JNA now.
>> 
>> Full version:
>> 
>>   I've performed some comparison between Foreign interface and JNA 
>> (which
>> is JNI under the bonnet). Also, I've throw pure-Java implementation of
>> same task to the mix, but as algorithms are not exactly the same, it 
>> is
>> more for reference.
>> 
>>   First, I want to describe what I measure. If you know what FFT is 
>> and
>> how FFTW library work, you could skip this part.
>> 
>> == What is FFT?
>> 
>>   Fast Fourier Transform, or FFT for short, is family of algorithms to
>> calculate Discrete Fourier Transform, which is used widely in many
>> areas, as digital signal processing, imaging, and others. Main 
>> parameter
>> of this algorithm is size of transform. Size defines amount of data
>> needed for one algorithm execution, and frequency resolution you could
>> achieve.
>> 
>>    Theoretically, these algorithms have O(n*log(n)) complexity in size 
>> of
>> transform (naive implementation of Discrete Fourier Transform is
>> O(n^2)). But real-life implementations have many subtleties, and
>> constant (which is not accounted by O-notation) could wary widely, and
>> depends both on transform size and implementation details. Most
>> straightforward "textbook" implementations works well only with
>> power-of-2 sizes, for example.
>> 
>> == What is FFTW3 which is used in this benchmark?
>> 
>>   Best general-purpose implementation of FFT algorithms is "FFTW3"
>> (http://www.fftw.org/) library. It is highly-optimized and very
>> sophisticated library, written in pure C. It supports very clever 
>> method
>> to decompose work at hand, and its API works like this:
>> 
>>    (1) You ask library to create "transform plan", and pass all 
>> details
>> about transform you need: its size (main one is parameter, of course),
>> input and output arrays (yes, at planning stage) and some flags, like
>> could library destroy data in input array or is should use temporary
>> array as working memory area. This "planning" stage could take some
>> time, as large transform sizes could be decomposed into sub-transforms
>> by multitude of means, and library do some internal benchmarking to
>> select best way. This operation must be doe only once for given
>> transform size and input-output configuration.
>> 
>>   (2) You call transform with created plan multiple times on different
>> data. Here are two possibilities:
>>     (a) You use the same input and output arrays, that were used at 
>> plan
>> creation time. It is best way, according to FFTW documentation.
>>     (b) You could pass new input and output arrays for each execution.
>> These arrays must be aligned as arrays passed to planning stage!
>> 
>> == JTransforms
>> 
>>   JTransforms is pure-Java FFT library which is well-optimised but is 
>> not
>> as "state-of-art" as FFTW. It has much simpler planner, it could not
>> have vectorized algorithms, and it depends on HotSpot heavily. But it
>> works on native Java `double[]` arrays and don't need any data
>> preparation to use.
>> 
>>   I've added JTransforms to benchmark to have some baseline, as it has 
>> no
>> overhead to call, and at small transform sizes call overhead 
>> dominates.
>> It is interesting to see, when more effective native code overcomes
>> Java-Native call overhead.
>> 
>> == What do I measure?
>> 
>>   I want to measure "real-world" calculation of FFTs of different size 
>> in
>> Java. It means, I want to have input data and output data in 
>> Java-native
>> `double[]` arrays, as I emulate Java calculations, and I assume, that
>> other data processing is pure-Java, so data need to be in Java-native
>> data structures.
>> 
>>   I measure two versions: "in-place" (input and output array are the 
>> one
>> array) and "out-of-place" (input and output arrays are tdifferent 
>> ones)
>> transforms.
>> 
>>   I measure only simplest, power-of-2 sizes, from 2^4 (16) to 2^18
>> (131072) complex numbers, so, there are twice as much `double` 
>> elements
>> in arrays.
>> 
>>   Also, I measure pure FFT speed, for reference. What does it mean? 
>> Pure
>> FFT speed doesn't include transfer of data from binding-dependent
>> structure from native Java array and back. Some bindings could avoid
>> this transfer (which is pure overhead) and for them "Pure FFT" and 
>> "Full
>> Calculation" are the same!
>> 
>>   I do NOT measure plan creation!
>> 
>> == Benchmarks implemented
>> 
>>    (1) JNAAllocated — this is JNA-created bindings, which use off-heap
>> allocated java.nio native buffers `DoubleBuffer`. It copy data in and
>> out Java arrays with `DoubleBuffer` API.
>> 
>>    (2) JNAWrapperd — this is JNA-created bindings, which use
>> `DoubleBuffer`-wrapped Java-native arrays and uses "execute with new
>> buffers" FFTW3 API to allow using Java-natibve arrays in native code.
>> 
>>    (3) JTransforms — this is simple pure-Java implementation with
>> JTransforms library. As JTransforms are always in-place transforms, it
>> needs to copy input data to make input array intact.
>> 
>>    (4) Panama — this is Panama/jextract-created bindings, which use
>> Panama `Arrays<>` with Panama allocator and such. It needs to copy in
>> and out data before and after transform.
>> 
>>    You could see all benchmarks and framework here:
>> 
>> https://github.com/blacklion/panama-benchmarks
>> 
>>   (I plan to add more benchmarks in the future for different Panama
>> sub-projects).
>> 
>> == Results.
>> 
>>    All results are collected on panama-jdk-13-44 on Windows/10, on
>> i5-4570, with single thread.
>> 
>>    Full results could be seen at this Google sheets:
>> https://docs.google.com/spreadsheets/d/1-7O16o-38yFIVx-LWTIpVxRnXCvdvs4NmNXLThT2KHI/edit?usp=sharing
>> 
>>   Please note, there are 3 Sheets: first one is plain CSV report from
>> JMH, second one is some analysis of out-of-place tasks and third one 
>> is
>> same analysis for in-place tasks.
>> 
>>   Please, note, that charts are effectively in Log-Log scale.
>> 
>>   Some summary of results:
>> 
>>    (1) Data copying to and from Panama arrays is enormously expensive 
>> and
>> dominate "full" execution time for any size. It makes Panama
>> implementation always slower that any other (including Pure Java).
>> 
>>    (2) Pure FFT calls cost the same for JNA and Panama.
>> 
>>    (3) JNA with new (wrapped) arrays is slower than JNA with off-heap
>> buffers for pure FFT. It needs to be investigated further, I think, it
>> is due to array alignment and selection (or absence of) of vectorized
>> code by FFTW. But at sizes larger than ~2048 full speed becomes equal.
>> 
>>   (4) Pure-Java JTransforms for this simple case (power-of-2 
>> transform)
>> is not much worse for large sizes and much better for small sizes. 
>> Data
>> copy in/out and native calls are still very, very expensive.
>>