Request for tracking down C1 optimizations: handwritten cartesian product similar to flatmap/map performance!

Fri May 30 12:48:17 UTC 2014

The quick thing to do, was to produce the log including the generated code:

http://cgi.di.uoa.gr/~biboudis/hotspot_pid5379.log

These are what I know:

   - The execution happens inside copyInto
   - we have three forEachRemaining calls (of, flatMap, map) that are
   delegated via the accept methods of the wrapped sink.
   - the second lambda which is captured is obtained via a method
   internalMemberName

If scalar replacement happens I should expect to see the captured lambda
spilled on the stack and accessed accordingly when the execution reaches a
request to the reference of the mapper lambda inside flatMap. From what I
understand, this should have happened in the accept of flatMap where the
inner lambda is linked and more specifically "inside" the mapper.apply of
LongPipeline:286, right (in terms of runtime execution)? And even more
specifically when accepting the captured lambda (lambda$8 to my
understanding).

On a side note, I'll produce a debug build of the vm to examine this
better. ;-) Thx for the direction.

Aggelos.

On Fri, May 30, 2014 at 11:40 AM, Paul Sandoz <paul.sandoz at oracle.com>
wrote:

> On May 29, 2014, at 7:55 PM, Aggelos Biboudis <biboudis at gmail.com> wrote:
>
> Hello all,
>
> I would like to ask you something regarding C1 compilation (VM options:
> -Xms769m -Xmx769m -XX:-TieredCompilation) of a Cartesian product stream
> operation with the new stream API.
> I have two versions of this computation, one handwritten and one with
> flatmap/map. It is remarkable that these two have similar performance so I
> would like to trace-back the JIT compilation decisions (apart from
> inlining), and more specifically if escape analysis has any effect.
>
> ...
>>
>> valuesHi = IntStream.range(0, 10000).mapToLong(i -> i).toArray();
>> valuesLo = IntStream.range(0, 1000).mapToLong(i -> i).toArray();
>>
>>
> Tip: you can also use LongStream.range(0, N).toArray();
>
>
> @GenerateMicroBenchmark // -> 4.984 ms / op on avg
>> public long cartSeq() {
>>    long cart
>>          = LongStream.of(valuesHi)
>>             .flatMap(d -> LongStream.of(valuesLo).map(dP -> dP * d))
>>
>
> The function of long -> LongStream passed to the flatMap operation will
> create two objects per call: an instance of LongStream; and an instance of
> Spliterator. The map operation is passed a capturing lambda so on each call
> it will create an instance of LongUnaryOperator.
>
> My hunch was, unless there is something else dominating, there some form
> of scalar replacement is going on. However, is it reasonable to assume that
> the difference of 700us is due to the cost of object allocations? To be
> sure one would need to look at the generated code.
>
> To factor out array reads you might want to measure:
>
>         long cart
>                 = LongStream.range(0, 10000)
>                 .flatMap(d -> LongStream.range(0, 1000).map(dP -> dP * d))
>                 .sum();
>
> Paul.
>
>
>
>             .sum();
>>    return cart;
>> }
>>
>> @GenerateMicroBenchmark // -> 4.258 ms / op on avg
>> public long cartBaseline() {
>>     long cart = 0;
>>     for (int d = 0 ; d < valuesHi.length ; d++) {
>>         for (int dp = 0 ; dp < valuesLo.length ; dp++){
>>         cart += valuesHi[d] * valuesLo[dp];
>>         }
>>     }
>>     return cart;
>> }
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/attachments/20140530/c0052b7f/attachment-0001.html>