Alternative Spliterator for SpinedBuffer

Mon Jun 10 07:58:31 PDT 2013

Hi Peter,

Finally got around to processing this, sorry for the delay.

I integrated the patch into the latest code base and made a couple of tweaks here and there:

  http://hg.openjdk.java.net/lambda/lambda/jdk/rev/8b5895496c63

--

I have also prepared a patch for tl, since this plus some minor changes is mostly isolated from some other implementation stuff:

  http://cr.openjdk.java.net/~psandoz/tl/JDK-8016251-spinedBuffer-splitr/webrev/

Paul.

On Apr 18, 2013, at 4:39 PM, Peter Levart <peter.levart at gmail.com> wrote:

> Hi Brian,
> 
> I did some performance measurements. The most obvious difference in 
> performance is observed when individual operations in the stream 
> pipeline take relatively long time, but even with fast individual 
> operations, the performance benefit can be observed. For example, this 
> test code:
> 
> public class StreamBuilderTest {
> 
>     static long test(boolean parallel) {
>         StreamBuilder.OfInt builder = Streams.intBuilder();
>         for (int i = 0; i < 10000000; i++) {
>             builder.accept(i);
>         }
> 
>         try {
>             System.gc();
>             Thread.sleep(500L);
>             System.gc();
>             Thread.sleep(500L);
>             System.gc();
>             Thread.sleep(500L);
>         } catch (InterruptedException e) {}
> 
>         long t0 = System.nanoTime();
>         double pi = (parallel ? builder.build().parallel() : 
> builder.build().sequential())
>                         .mapToDouble(i ->
>                             (i & 1) == 0
>                             ? 4d / ((2d * i + 2d) * (2d * i + 3d) * (2d 
> * i + 4d))
>                             : -4d / ((2d * i + 2d) * (2d * i + 3d) * 
> (2d * i + 4d))
>                         )
>                         .sum() + 3d;
>         long t = System.nanoTime() - t0;
>         System.out.println("  pi=" + pi + " in " + t + " ns");
>         return t;
>     }
> 
>     static void testN(int n, boolean parallel) {
>         System.out.println("  warm-up");
>         for (int i = 0; i < 5; i++) test(parallel);
>         System.out.println("  measure");
>         long tSum = 0;
>         for (int i = 0; i < n; i++) tSum += test(parallel);
>         System.out.println("  average: " + ((double) tSum / n) + " ns");
>     }
> 
>     public static void main(String[] args) {
>         System.out.println("Sequential:");
>         testN(10, false);
>         System.out.println("Parallel:");
>         testN(10, true);
>     }
> }
> 
> 
> Prints the following with current Spliterator implementation:
> 
> Sequential:
>   warm-up
>   pi=3.1415926535897913 in 115127248 ns
>   pi=3.1415926535897913 in 70426262 ns
>   pi=3.1415926535897913 in 74633777 ns
>   pi=3.1415926535897913 in 70558263 ns
>   pi=3.1415926535897913 in 68995773 ns
>   measure
>   pi=3.1415926535897913 in 67116015 ns
>   pi=3.1415926535897913 in 72918687 ns
>   pi=3.1415926535897913 in 68584475 ns
>   pi=3.1415926535897913 in 67871743 ns
>   pi=3.1415926535897913 in 72419475 ns
>   pi=3.1415926535897913 in 67932912 ns
>   pi=3.1415926535897913 in 71237839 ns
>   pi=3.1415926535897913 in 69636412 ns
>   pi=3.1415926535897913 in 66849404 ns
>   pi=3.1415926535897913 in 72220404 ns
>   average: 6.96787366E7 ns
> Parallel:
>   warm-up
>   pi=3.141592653589793 in 27772347 ns
>   pi=3.141592653589793 in 18209373 ns
>   pi=3.141592653589793 in 18516459 ns
>   pi=3.141592653589793 in 16655371 ns
>   pi=3.141592653589793 in 19956267 ns
>   measure
>   pi=3.141592653589793 in 21184246 ns
>   pi=3.141592653589793 in 20352169 ns
>   pi=3.141592653589793 in 21195477 ns
>   pi=3.141592653589793 in 21287557 ns
>   pi=3.141592653589793 in 16230650 ns
>   pi=3.141592653589793 in 20035680 ns
>   pi=3.141592653589793 in 18355295 ns
>   pi=3.141592653589793 in 21971399 ns
>   pi=3.141592653589793 in 20552600 ns
>   pi=3.141592653589793 in 19332368 ns
>   average: 2.00497441E7 ns
> 
> 
> And the following with alternative Spliterator:
> 
> Sequential:
>   warm-up
>   pi=3.1415926535897913 in 111301836 ns
>   pi=3.1415926535897913 in 73685822 ns
>   pi=3.1415926535897913 in 68454448 ns
>   pi=3.1415926535897913 in 71583982 ns
>   pi=3.1415926535897913 in 68397094 ns
>   measure
>   pi=3.1415926535897913 in 67047559 ns
>   pi=3.1415926535897913 in 67424431 ns
>   pi=3.1415926535897913 in 67079863 ns
>   pi=3.1415926535897913 in 73486488 ns
>   pi=3.1415926535897913 in 76055108 ns
>   pi=3.1415926535897913 in 74366550 ns
>   pi=3.1415926535897913 in 69808517 ns
>   pi=3.1415926535897913 in 65960535 ns
>   pi=3.1415926535897913 in 65969289 ns
>   pi=3.1415926535897913 in 70446736 ns
>   average: 6.97645076E7 ns
> Parallel:
>   warm-up
>   pi=3.1415926535897913 in 21529026 ns
>   pi=3.1415926535897913 in 15567913 ns
>   pi=3.1415926535897913 in 15090407 ns
>   pi=3.1415926535897913 in 15071715 ns
>   pi=3.1415926535897913 in 14593676 ns
>   measure
>   pi=3.1415926535897913 in 15227210 ns
>   pi=3.1415926535897913 in 15406588 ns
>   pi=3.1415926535897913 in 14631733 ns
>   pi=3.1415926535897913 in 15719352 ns
>   pi=3.1415926535897913 in 15201443 ns
>   pi=3.1415926535897913 in 16537214 ns
>   pi=3.1415926535897913 in 15365526 ns
>   pi=3.1415926535897913 in 14674482 ns
>   pi=3.1415926535897913 in 19898638 ns
>   pi=3.1415926535897913 in 16947804 ns
>   average: 1.5960999E7 ns
> 
> 
> But don't take this measurements for granted. You can try it for 
> yourself. I'm sure you have lots of performance tests that involve 
> SpinedBuffer spliterator. Here's the updated patch that also 
> incorporates primitive variants:
> 
> http://dl.dropboxusercontent.com/u/101777488/jdk8-lambda/SpinedBufferAltSpliterator/webrev.02/index.html
> 
> I made it so that original/alternative implementation can be switched 
> with a system property so that the same test can be run with both 
> implementations easily to compare results.
> 
> 
> Regards, Peter
> 
> 
> 
> On 04/12/2013 07:10 PM, Brian Goetz wrote:
>> Thanks for taking a look at this code!  Your approach does look like 
>> it would yield a better decomposition -- and this is a really good 
>> time to be looking at this, as the APIs have mostly settled.
>> 
>> I agree that there are improvements to be made here.  Have your 
>> performance measurements as improved runtime as well as reduced task 
>> count?
>> 
>> On 4/12/2013 11:10 AM, Peter Levart wrote:
>>> On 04/11/2013 09:53 PM, brian.goetz at oracle.com wrote:
>>>> Changeset: 5e7f8797b9c1
>>>> Author:    briangoetz
>>>> Date:      2013-04-11 15:53 -0400
>>>> URL: http://hg.openjdk.java.net/lambda/lambda/jdk/rev/5e7f8797b9c1
>>>> 
>>>> Spec pass on StreamBuilder
>>>> 
>>>> ! src/share/classes/java/util/stream/StreamBuilder.java
>>>> ! src/share/classes/java/util/stream/Streams.java
>>>> 
>>>> 
>>> 
>>> Hi,
>>> 
>>> I played a little with StreamBuilder and it works quite optimally using
>>> private SpinedBuffer under the hood. The only sub-optimality is when it
>>> starts small (min. 16 elements) and then grows big and after that it is
>>> used in a parallel computation where leaf task maximum size is more than
>>> 16 elements. For example, this code:
>>> 
>>> 
>>> public class SpinedBufferTest {
>>> 
>>>     static int dump(Spliterator<Integer> spliterator, int level, int
>>> iterateSize) {
>>> System.out.print(" ".substring(0, level));
>>>         System.out.print(spliterator.estimateSize());
>>>         Spliterator<Integer> spl2;
>>>         if (spliterator.estimateSize() > iterateSize && (spl2 =
>>> spliterator.trySplit()) != null) {
>>>             System.out.println(": split " + spl2.estimateSize() + " + "
>>> + spliterator.estimateSize());
>>>             return dump(spl2, level + 1, iterateSize) +
>>>                    dump(spliterator, level + 1, iterateSize);
>>>         }
>>>         else {
>>>             Integer[] minMax = new Integer[2];
>>>             spliterator.forEachRemaining(x -> {
>>>                 minMax[0] = minMax[0] == null ? x : Math.min(minMax[0],
>>> x);
>>>                 minMax[1] = minMax[1] == null ? x : Math.max(minMax[1],
>>> x);
>>>             });
>>>             System.out.println(": (" + minMax[0] + "..." + minMax[1] +
>>> ")");
>>>             return 1;
>>>         }
>>>     }
>>> 
>>>     public static void main(String[] args) {
>>>         SpinedBuffer<Integer> buf = new SpinedBuffer<>();
>>>         for (int i = 0; i < 16384; i++) {
>>>             buf.accept(i);
>>>         }
>>>         int leafs = dump(buf.spliterator(), 0, 1024);
>>>         System.out.println("\nTotal leafs: " + leafs);
>>>     }
>>> }
>>> 
>>> 
>>> Prints the splits that would occur for 16k elements when max. leaf task
>>> size is 1k:
>>> 
>>> 
>>> 16384: split 16 + 16368
>>>  16: (0...15)
>>>  16368: split 16 + 16352
>>>   16: (16...31)
>>>   16352: split 32 + 16320
>>>    32: (32...63)
>>>    16320: split 64 + 16256
>>>     64: (64...127)
>>>     16256: split 128 + 16128
>>>      128: (128...255)
>>>      16128: split 256 + 15872
>>>       256: (256...511)
>>>       15872: split 512 + 15360
>>>        512: (512...1023)
>>>        15360: split 1024 + 14336
>>>         1024: (1024...2047)
>>>         14336: split 2048 + 12288
>>>          2048: split 1024 + 1024
>>>           1024: (2048...3071)
>>>           1024: (3072...4095)
>>>          12288: split 4096 + 8192
>>>           4096: split 2048 + 2048
>>>            2048: split 1024 + 1024
>>>             1024: (4096...5119)
>>>             1024: (5120...6143)
>>>            2048: split 1024 + 1024
>>>             1024: (6144...7167)
>>>             1024: (7168...8191)
>>>           8192: split 4096 + 4096
>>>            4096: split 2048 + 2048
>>>             2048: split 1024 + 1024
>>>              1024: (8192...9215)
>>>              1024: (9216...10239)
>>>             2048: split 1024 + 1024
>>>              1024: (10240...11263)
>>>              1024: (11264...12287)
>>>            4096: split 2048 + 2048
>>>             2048: split 1024 + 1024
>>>              1024: (12288...13311)
>>>              1024: (13312...14335)
>>>             2048: split 1024 + 1024
>>>              1024: (14336...15359)
>>>              1024: (15360...16383)
>>> 
>>> Total leafs: 22
>>> 
>>> 
>>> 7 leaf tasks are half the max. size or less.
>>> 
>>> Now SpinedBuffer has chunks with sizes increasing exponentially using
>>> power of 2:  16, 16, 32, 64, 128, 256, 512, ... Each chunk has capacity
>>> equal to the sum of capacities of all previous chunks. So nearly optimal
>>> split (modulo last non-full chunk) would be to always split so that the
>>> last chunk is cut-off and the returned Spliterator contains all the
>>> chunks up to the last and this Spliterator only contains the last chunk.
>>> 
>>> Here's my attempt at modifying the spliterator to act like that (I only
>>> did it for the generic Object variant in 1st try, but I can prepare the
>>> primitive variants too, if desired):
>>> 
>>> http://dl.dropboxusercontent.com/u/101777488/jdk8-lambda/SpinedBufferAltSpliterator/webrev.01/index.html 
>>> 
>>> 
>>> 
>>> 
>>> With this modification, the above code prints:
>>> 
>>> 16384: split 8192 + 8192
>>>  8192: split 4096 + 4096
>>>   4096: split 2048 + 2048
>>>    2048: split 1024 + 1024
>>>     1024: (0...1023)
>>>     1024: (1024...2047)
>>>    2048: split 1024 + 1024
>>>     1024: (2048...3071)
>>>     1024: (3072...4095)
>>>   4096: split 2048 + 2048
>>>    2048: split 1024 + 1024
>>>     1024: (4096...5119)
>>>     1024: (5120...6143)
>>>    2048: split 1024 + 1024
>>>     1024: (6144...7167)
>>>     1024: (7168...8191)
>>>  8192: split 4096 + 4096
>>>   4096: split 2048 + 2048
>>>    2048: split 1024 + 1024
>>>     1024: (8192...9215)
>>>     1024: (9216...10239)
>>>    2048: split 1024 + 1024
>>>     1024: (10240...11263)
>>>     1024: (11264...12287)
>>>   4096: split 2048 + 2048
>>>    2048: split 1024 + 1024
>>>     1024: (12288...13311)
>>>     1024: (13312...14335)
>>>    2048: split 1024 + 1024
>>>     1024: (14336...15359)
>>>     1024: (15360...16383)
>>> 
>>> Total leafs: 16
>>> 
>>> 
>>> Regards, Peter
>>> 
> 
>