Alternative Spliterator for SpinedBuffer

Thu Apr 18 07:39:20 PDT 2013

Hi Brian,

I did some performance measurements. The most obvious difference in 
performance is observed when individual operations in the stream 
pipeline take relatively long time, but even with fast individual 
operations, the performance benefit can be observed. For example, this 
test code:

public class StreamBuilderTest {

     static long test(boolean parallel) {
         StreamBuilder.OfInt builder = Streams.intBuilder();
         for (int i = 0; i < 10000000; i++) {
             builder.accept(i);
         }

         try {
             System.gc();
             Thread.sleep(500L);
             System.gc();
             Thread.sleep(500L);
             System.gc();
             Thread.sleep(500L);
         } catch (InterruptedException e) {}

         long t0 = System.nanoTime();
         double pi = (parallel ? builder.build().parallel() : 
builder.build().sequential())
                         .mapToDouble(i ->
                             (i & 1) == 0
                             ? 4d / ((2d * i + 2d) * (2d * i + 3d) * (2d 
* i + 4d))
                             : -4d / ((2d * i + 2d) * (2d * i + 3d) * 
(2d * i + 4d))
                         )
                         .sum() + 3d;
         long t = System.nanoTime() - t0;
         System.out.println("  pi=" + pi + " in " + t + " ns");
         return t;
     }

     static void testN(int n, boolean parallel) {
         System.out.println("  warm-up");
         for (int i = 0; i < 5; i++) test(parallel);
         System.out.println("  measure");
         long tSum = 0;
         for (int i = 0; i < n; i++) tSum += test(parallel);
         System.out.println("  average: " + ((double) tSum / n) + " ns");
     }

     public static void main(String[] args) {
         System.out.println("Sequential:");
         testN(10, false);
         System.out.println("Parallel:");
         testN(10, true);
     }
}

Prints the following with current Spliterator implementation:

Sequential:
   warm-up
   pi=3.1415926535897913 in 115127248 ns
   pi=3.1415926535897913 in 70426262 ns
   pi=3.1415926535897913 in 74633777 ns
   pi=3.1415926535897913 in 70558263 ns
   pi=3.1415926535897913 in 68995773 ns
   measure
   pi=3.1415926535897913 in 67116015 ns
   pi=3.1415926535897913 in 72918687 ns
   pi=3.1415926535897913 in 68584475 ns
   pi=3.1415926535897913 in 67871743 ns
   pi=3.1415926535897913 in 72419475 ns
   pi=3.1415926535897913 in 67932912 ns
   pi=3.1415926535897913 in 71237839 ns
   pi=3.1415926535897913 in 69636412 ns
   pi=3.1415926535897913 in 66849404 ns
   pi=3.1415926535897913 in 72220404 ns
   average: 6.96787366E7 ns
Parallel:
   warm-up
   pi=3.141592653589793 in 27772347 ns
   pi=3.141592653589793 in 18209373 ns
   pi=3.141592653589793 in 18516459 ns
   pi=3.141592653589793 in 16655371 ns
   pi=3.141592653589793 in 19956267 ns
   measure
   pi=3.141592653589793 in 21184246 ns
   pi=3.141592653589793 in 20352169 ns
   pi=3.141592653589793 in 21195477 ns
   pi=3.141592653589793 in 21287557 ns
   pi=3.141592653589793 in 16230650 ns
   pi=3.141592653589793 in 20035680 ns
   pi=3.141592653589793 in 18355295 ns
   pi=3.141592653589793 in 21971399 ns
   pi=3.141592653589793 in 20552600 ns
   pi=3.141592653589793 in 19332368 ns
   average: 2.00497441E7 ns

And the following with alternative Spliterator:

Sequential:
   warm-up
   pi=3.1415926535897913 in 111301836 ns
   pi=3.1415926535897913 in 73685822 ns
   pi=3.1415926535897913 in 68454448 ns
   pi=3.1415926535897913 in 71583982 ns
   pi=3.1415926535897913 in 68397094 ns
   measure
   pi=3.1415926535897913 in 67047559 ns
   pi=3.1415926535897913 in 67424431 ns
   pi=3.1415926535897913 in 67079863 ns
   pi=3.1415926535897913 in 73486488 ns
   pi=3.1415926535897913 in 76055108 ns
   pi=3.1415926535897913 in 74366550 ns
   pi=3.1415926535897913 in 69808517 ns
   pi=3.1415926535897913 in 65960535 ns
   pi=3.1415926535897913 in 65969289 ns
   pi=3.1415926535897913 in 70446736 ns
   average: 6.97645076E7 ns
Parallel:
   warm-up
   pi=3.1415926535897913 in 21529026 ns
   pi=3.1415926535897913 in 15567913 ns
   pi=3.1415926535897913 in 15090407 ns
   pi=3.1415926535897913 in 15071715 ns
   pi=3.1415926535897913 in 14593676 ns
   measure
   pi=3.1415926535897913 in 15227210 ns
   pi=3.1415926535897913 in 15406588 ns
   pi=3.1415926535897913 in 14631733 ns
   pi=3.1415926535897913 in 15719352 ns
   pi=3.1415926535897913 in 15201443 ns
   pi=3.1415926535897913 in 16537214 ns
   pi=3.1415926535897913 in 15365526 ns
   pi=3.1415926535897913 in 14674482 ns
   pi=3.1415926535897913 in 19898638 ns
   pi=3.1415926535897913 in 16947804 ns
   average: 1.5960999E7 ns

But don't take this measurements for granted. You can try it for 
yourself. I'm sure you have lots of performance tests that involve 
SpinedBuffer spliterator. Here's the updated patch that also 
incorporates primitive variants:

http://dl.dropboxusercontent.com/u/101777488/jdk8-lambda/SpinedBufferAltSpliterator/webrev.02/index.html

I made it so that original/alternative implementation can be switched 
with a system property so that the same test can be run with both 
implementations easily to compare results.

Regards, Peter

On 04/12/2013 07:10 PM, Brian Goetz wrote:
> Thanks for taking a look at this code!  Your approach does look like 
> it would yield a better decomposition -- and this is a really good 
> time to be looking at this, as the APIs have mostly settled.
>
> I agree that there are improvements to be made here.  Have your 
> performance measurements as improved runtime as well as reduced task 
> count?
>
> On 4/12/2013 11:10 AM, Peter Levart wrote:
>> On 04/11/2013 09:53 PM, brian.goetz at oracle.com wrote:
>>> Changeset: 5e7f8797b9c1
>>> Author:    briangoetz
>>> Date:      2013-04-11 15:53 -0400
>>> URL: http://hg.openjdk.java.net/lambda/lambda/jdk/rev/5e7f8797b9c1
>>>
>>> Spec pass on StreamBuilder
>>>
>>> ! src/share/classes/java/util/stream/StreamBuilder.java
>>> ! src/share/classes/java/util/stream/Streams.java
>>>
>>>
>>
>> Hi,
>>
>> I played a little with StreamBuilder and it works quite optimally using
>> private SpinedBuffer under the hood. The only sub-optimality is when it
>> starts small (min. 16 elements) and then grows big and after that it is
>> used in a parallel computation where leaf task maximum size is more than
>> 16 elements. For example, this code:
>>
>>
>> public class SpinedBufferTest {
>>
>>      static int dump(Spliterator<Integer> spliterator, int level, int
>> iterateSize) {
>> System.out.print(" ".substring(0, level));
>>          System.out.print(spliterator.estimateSize());
>>          Spliterator<Integer> spl2;
>>          if (spliterator.estimateSize() > iterateSize && (spl2 =
>> spliterator.trySplit()) != null) {
>>              System.out.println(": split " + spl2.estimateSize() + " + "
>> + spliterator.estimateSize());
>>              return dump(spl2, level + 1, iterateSize) +
>>                     dump(spliterator, level + 1, iterateSize);
>>          }
>>          else {
>>              Integer[] minMax = new Integer[2];
>>              spliterator.forEachRemaining(x -> {
>>                  minMax[0] = minMax[0] == null ? x : Math.min(minMax[0],
>> x);
>>                  minMax[1] = minMax[1] == null ? x : Math.max(minMax[1],
>> x);
>>              });
>>              System.out.println(": (" + minMax[0] + "..." + minMax[1] +
>> ")");
>>              return 1;
>>          }
>>      }
>>
>>      public static void main(String[] args) {
>>          SpinedBuffer<Integer> buf = new SpinedBuffer<>();
>>          for (int i = 0; i < 16384; i++) {
>>              buf.accept(i);
>>          }
>>          int leafs = dump(buf.spliterator(), 0, 1024);
>>          System.out.println("\nTotal leafs: " + leafs);
>>      }
>> }
>>
>>
>> Prints the splits that would occur for 16k elements when max. leaf task
>> size is 1k:
>>
>>
>> 16384: split 16 + 16368
>>   16: (0...15)
>>   16368: split 16 + 16352
>>    16: (16...31)
>>    16352: split 32 + 16320
>>     32: (32...63)
>>     16320: split 64 + 16256
>>      64: (64...127)
>>      16256: split 128 + 16128
>>       128: (128...255)
>>       16128: split 256 + 15872
>>        256: (256...511)
>>        15872: split 512 + 15360
>>         512: (512...1023)
>>         15360: split 1024 + 14336
>>          1024: (1024...2047)
>>          14336: split 2048 + 12288
>>           2048: split 1024 + 1024
>>            1024: (2048...3071)
>>            1024: (3072...4095)
>>           12288: split 4096 + 8192
>>            4096: split 2048 + 2048
>>             2048: split 1024 + 1024
>>              1024: (4096...5119)
>>              1024: (5120...6143)
>>             2048: split 1024 + 1024
>>              1024: (6144...7167)
>>              1024: (7168...8191)
>>            8192: split 4096 + 4096
>>             4096: split 2048 + 2048
>>              2048: split 1024 + 1024
>>               1024: (8192...9215)
>>               1024: (9216...10239)
>>              2048: split 1024 + 1024
>>               1024: (10240...11263)
>>               1024: (11264...12287)
>>             4096: split 2048 + 2048
>>              2048: split 1024 + 1024
>>               1024: (12288...13311)
>>               1024: (13312...14335)
>>              2048: split 1024 + 1024
>>               1024: (14336...15359)
>>               1024: (15360...16383)
>>
>> Total leafs: 22
>>
>>
>> 7 leaf tasks are half the max. size or less.
>>
>> Now SpinedBuffer has chunks with sizes increasing exponentially using
>> power of 2:  16, 16, 32, 64, 128, 256, 512, ... Each chunk has capacity
>> equal to the sum of capacities of all previous chunks. So nearly optimal
>> split (modulo last non-full chunk) would be to always split so that the
>> last chunk is cut-off and the returned Spliterator contains all the
>> chunks up to the last and this Spliterator only contains the last chunk.
>>
>> Here's my attempt at modifying the spliterator to act like that (I only
>> did it for the generic Object variant in 1st try, but I can prepare the
>> primitive variants too, if desired):
>>
>> http://dl.dropboxusercontent.com/u/101777488/jdk8-lambda/SpinedBufferAltSpliterator/webrev.01/index.html 
>>
>>
>>
>>
>> With this modification, the above code prints:
>>
>> 16384: split 8192 + 8192
>>   8192: split 4096 + 4096
>>    4096: split 2048 + 2048
>>     2048: split 1024 + 1024
>>      1024: (0...1023)
>>      1024: (1024...2047)
>>     2048: split 1024 + 1024
>>      1024: (2048...3071)
>>      1024: (3072...4095)
>>    4096: split 2048 + 2048
>>     2048: split 1024 + 1024
>>      1024: (4096...5119)
>>      1024: (5120...6143)
>>     2048: split 1024 + 1024
>>      1024: (6144...7167)
>>      1024: (7168...8191)
>>   8192: split 4096 + 4096
>>    4096: split 2048 + 2048
>>     2048: split 1024 + 1024
>>      1024: (8192...9215)
>>      1024: (9216...10239)
>>     2048: split 1024 + 1024
>>      1024: (10240...11263)
>>      1024: (11264...12287)
>>    4096: split 2048 + 2048
>>     2048: split 1024 + 1024
>>      1024: (12288...13311)
>>      1024: (13312...14335)
>>     2048: split 1024 + 1024
>>      1024: (14336...15359)
>>      1024: (15360...16383)
>>
>> Total leafs: 16
>>
>>
>> Regards, Peter
>>