Alternative Spliterator for SpinedBuffer
Paul Sandoz
paul.sandoz at oracle.com
Mon Jun 10 07:58:31 PDT 2013
Hi Peter,
Finally got around to processing this, sorry for the delay.
I integrated the patch into the latest code base and made a couple of tweaks here and there:
http://hg.openjdk.java.net/lambda/lambda/jdk/rev/8b5895496c63
--
I have also prepared a patch for tl, since this plus some minor changes is mostly isolated from some other implementation stuff:
http://cr.openjdk.java.net/~psandoz/tl/JDK-8016251-spinedBuffer-splitr/webrev/
Paul.
On Apr 18, 2013, at 4:39 PM, Peter Levart <peter.levart at gmail.com> wrote:
> Hi Brian,
>
> I did some performance measurements. The most obvious difference in
> performance is observed when individual operations in the stream
> pipeline take relatively long time, but even with fast individual
> operations, the performance benefit can be observed. For example, this
> test code:
>
> public class StreamBuilderTest {
>
> static long test(boolean parallel) {
> StreamBuilder.OfInt builder = Streams.intBuilder();
> for (int i = 0; i < 10000000; i++) {
> builder.accept(i);
> }
>
> try {
> System.gc();
> Thread.sleep(500L);
> System.gc();
> Thread.sleep(500L);
> System.gc();
> Thread.sleep(500L);
> } catch (InterruptedException e) {}
>
> long t0 = System.nanoTime();
> double pi = (parallel ? builder.build().parallel() :
> builder.build().sequential())
> .mapToDouble(i ->
> (i & 1) == 0
> ? 4d / ((2d * i + 2d) * (2d * i + 3d) * (2d
> * i + 4d))
> : -4d / ((2d * i + 2d) * (2d * i + 3d) *
> (2d * i + 4d))
> )
> .sum() + 3d;
> long t = System.nanoTime() - t0;
> System.out.println(" pi=" + pi + " in " + t + " ns");
> return t;
> }
>
> static void testN(int n, boolean parallel) {
> System.out.println(" warm-up");
> for (int i = 0; i < 5; i++) test(parallel);
> System.out.println(" measure");
> long tSum = 0;
> for (int i = 0; i < n; i++) tSum += test(parallel);
> System.out.println(" average: " + ((double) tSum / n) + " ns");
> }
>
> public static void main(String[] args) {
> System.out.println("Sequential:");
> testN(10, false);
> System.out.println("Parallel:");
> testN(10, true);
> }
> }
>
>
> Prints the following with current Spliterator implementation:
>
> Sequential:
> warm-up
> pi=3.1415926535897913 in 115127248 ns
> pi=3.1415926535897913 in 70426262 ns
> pi=3.1415926535897913 in 74633777 ns
> pi=3.1415926535897913 in 70558263 ns
> pi=3.1415926535897913 in 68995773 ns
> measure
> pi=3.1415926535897913 in 67116015 ns
> pi=3.1415926535897913 in 72918687 ns
> pi=3.1415926535897913 in 68584475 ns
> pi=3.1415926535897913 in 67871743 ns
> pi=3.1415926535897913 in 72419475 ns
> pi=3.1415926535897913 in 67932912 ns
> pi=3.1415926535897913 in 71237839 ns
> pi=3.1415926535897913 in 69636412 ns
> pi=3.1415926535897913 in 66849404 ns
> pi=3.1415926535897913 in 72220404 ns
> average: 6.96787366E7 ns
> Parallel:
> warm-up
> pi=3.141592653589793 in 27772347 ns
> pi=3.141592653589793 in 18209373 ns
> pi=3.141592653589793 in 18516459 ns
> pi=3.141592653589793 in 16655371 ns
> pi=3.141592653589793 in 19956267 ns
> measure
> pi=3.141592653589793 in 21184246 ns
> pi=3.141592653589793 in 20352169 ns
> pi=3.141592653589793 in 21195477 ns
> pi=3.141592653589793 in 21287557 ns
> pi=3.141592653589793 in 16230650 ns
> pi=3.141592653589793 in 20035680 ns
> pi=3.141592653589793 in 18355295 ns
> pi=3.141592653589793 in 21971399 ns
> pi=3.141592653589793 in 20552600 ns
> pi=3.141592653589793 in 19332368 ns
> average: 2.00497441E7 ns
>
>
> And the following with alternative Spliterator:
>
> Sequential:
> warm-up
> pi=3.1415926535897913 in 111301836 ns
> pi=3.1415926535897913 in 73685822 ns
> pi=3.1415926535897913 in 68454448 ns
> pi=3.1415926535897913 in 71583982 ns
> pi=3.1415926535897913 in 68397094 ns
> measure
> pi=3.1415926535897913 in 67047559 ns
> pi=3.1415926535897913 in 67424431 ns
> pi=3.1415926535897913 in 67079863 ns
> pi=3.1415926535897913 in 73486488 ns
> pi=3.1415926535897913 in 76055108 ns
> pi=3.1415926535897913 in 74366550 ns
> pi=3.1415926535897913 in 69808517 ns
> pi=3.1415926535897913 in 65960535 ns
> pi=3.1415926535897913 in 65969289 ns
> pi=3.1415926535897913 in 70446736 ns
> average: 6.97645076E7 ns
> Parallel:
> warm-up
> pi=3.1415926535897913 in 21529026 ns
> pi=3.1415926535897913 in 15567913 ns
> pi=3.1415926535897913 in 15090407 ns
> pi=3.1415926535897913 in 15071715 ns
> pi=3.1415926535897913 in 14593676 ns
> measure
> pi=3.1415926535897913 in 15227210 ns
> pi=3.1415926535897913 in 15406588 ns
> pi=3.1415926535897913 in 14631733 ns
> pi=3.1415926535897913 in 15719352 ns
> pi=3.1415926535897913 in 15201443 ns
> pi=3.1415926535897913 in 16537214 ns
> pi=3.1415926535897913 in 15365526 ns
> pi=3.1415926535897913 in 14674482 ns
> pi=3.1415926535897913 in 19898638 ns
> pi=3.1415926535897913 in 16947804 ns
> average: 1.5960999E7 ns
>
>
> But don't take this measurements for granted. You can try it for
> yourself. I'm sure you have lots of performance tests that involve
> SpinedBuffer spliterator. Here's the updated patch that also
> incorporates primitive variants:
>
> http://dl.dropboxusercontent.com/u/101777488/jdk8-lambda/SpinedBufferAltSpliterator/webrev.02/index.html
>
> I made it so that original/alternative implementation can be switched
> with a system property so that the same test can be run with both
> implementations easily to compare results.
>
>
> Regards, Peter
>
>
>
> On 04/12/2013 07:10 PM, Brian Goetz wrote:
>> Thanks for taking a look at this code! Your approach does look like
>> it would yield a better decomposition -- and this is a really good
>> time to be looking at this, as the APIs have mostly settled.
>>
>> I agree that there are improvements to be made here. Have your
>> performance measurements as improved runtime as well as reduced task
>> count?
>>
>> On 4/12/2013 11:10 AM, Peter Levart wrote:
>>> On 04/11/2013 09:53 PM, brian.goetz at oracle.com wrote:
>>>> Changeset: 5e7f8797b9c1
>>>> Author: briangoetz
>>>> Date: 2013-04-11 15:53 -0400
>>>> URL: http://hg.openjdk.java.net/lambda/lambda/jdk/rev/5e7f8797b9c1
>>>>
>>>> Spec pass on StreamBuilder
>>>>
>>>> ! src/share/classes/java/util/stream/StreamBuilder.java
>>>> ! src/share/classes/java/util/stream/Streams.java
>>>>
>>>>
>>>
>>> Hi,
>>>
>>> I played a little with StreamBuilder and it works quite optimally using
>>> private SpinedBuffer under the hood. The only sub-optimality is when it
>>> starts small (min. 16 elements) and then grows big and after that it is
>>> used in a parallel computation where leaf task maximum size is more than
>>> 16 elements. For example, this code:
>>>
>>>
>>> public class SpinedBufferTest {
>>>
>>> static int dump(Spliterator<Integer> spliterator, int level, int
>>> iterateSize) {
>>> System.out.print(" ".substring(0, level));
>>> System.out.print(spliterator.estimateSize());
>>> Spliterator<Integer> spl2;
>>> if (spliterator.estimateSize() > iterateSize && (spl2 =
>>> spliterator.trySplit()) != null) {
>>> System.out.println(": split " + spl2.estimateSize() + " + "
>>> + spliterator.estimateSize());
>>> return dump(spl2, level + 1, iterateSize) +
>>> dump(spliterator, level + 1, iterateSize);
>>> }
>>> else {
>>> Integer[] minMax = new Integer[2];
>>> spliterator.forEachRemaining(x -> {
>>> minMax[0] = minMax[0] == null ? x : Math.min(minMax[0],
>>> x);
>>> minMax[1] = minMax[1] == null ? x : Math.max(minMax[1],
>>> x);
>>> });
>>> System.out.println(": (" + minMax[0] + "..." + minMax[1] +
>>> ")");
>>> return 1;
>>> }
>>> }
>>>
>>> public static void main(String[] args) {
>>> SpinedBuffer<Integer> buf = new SpinedBuffer<>();
>>> for (int i = 0; i < 16384; i++) {
>>> buf.accept(i);
>>> }
>>> int leafs = dump(buf.spliterator(), 0, 1024);
>>> System.out.println("\nTotal leafs: " + leafs);
>>> }
>>> }
>>>
>>>
>>> Prints the splits that would occur for 16k elements when max. leaf task
>>> size is 1k:
>>>
>>>
>>> 16384: split 16 + 16368
>>> 16: (0...15)
>>> 16368: split 16 + 16352
>>> 16: (16...31)
>>> 16352: split 32 + 16320
>>> 32: (32...63)
>>> 16320: split 64 + 16256
>>> 64: (64...127)
>>> 16256: split 128 + 16128
>>> 128: (128...255)
>>> 16128: split 256 + 15872
>>> 256: (256...511)
>>> 15872: split 512 + 15360
>>> 512: (512...1023)
>>> 15360: split 1024 + 14336
>>> 1024: (1024...2047)
>>> 14336: split 2048 + 12288
>>> 2048: split 1024 + 1024
>>> 1024: (2048...3071)
>>> 1024: (3072...4095)
>>> 12288: split 4096 + 8192
>>> 4096: split 2048 + 2048
>>> 2048: split 1024 + 1024
>>> 1024: (4096...5119)
>>> 1024: (5120...6143)
>>> 2048: split 1024 + 1024
>>> 1024: (6144...7167)
>>> 1024: (7168...8191)
>>> 8192: split 4096 + 4096
>>> 4096: split 2048 + 2048
>>> 2048: split 1024 + 1024
>>> 1024: (8192...9215)
>>> 1024: (9216...10239)
>>> 2048: split 1024 + 1024
>>> 1024: (10240...11263)
>>> 1024: (11264...12287)
>>> 4096: split 2048 + 2048
>>> 2048: split 1024 + 1024
>>> 1024: (12288...13311)
>>> 1024: (13312...14335)
>>> 2048: split 1024 + 1024
>>> 1024: (14336...15359)
>>> 1024: (15360...16383)
>>>
>>> Total leafs: 22
>>>
>>>
>>> 7 leaf tasks are half the max. size or less.
>>>
>>> Now SpinedBuffer has chunks with sizes increasing exponentially using
>>> power of 2: 16, 16, 32, 64, 128, 256, 512, ... Each chunk has capacity
>>> equal to the sum of capacities of all previous chunks. So nearly optimal
>>> split (modulo last non-full chunk) would be to always split so that the
>>> last chunk is cut-off and the returned Spliterator contains all the
>>> chunks up to the last and this Spliterator only contains the last chunk.
>>>
>>> Here's my attempt at modifying the spliterator to act like that (I only
>>> did it for the generic Object variant in 1st try, but I can prepare the
>>> primitive variants too, if desired):
>>>
>>> http://dl.dropboxusercontent.com/u/101777488/jdk8-lambda/SpinedBufferAltSpliterator/webrev.01/index.html
>>>
>>>
>>>
>>>
>>> With this modification, the above code prints:
>>>
>>> 16384: split 8192 + 8192
>>> 8192: split 4096 + 4096
>>> 4096: split 2048 + 2048
>>> 2048: split 1024 + 1024
>>> 1024: (0...1023)
>>> 1024: (1024...2047)
>>> 2048: split 1024 + 1024
>>> 1024: (2048...3071)
>>> 1024: (3072...4095)
>>> 4096: split 2048 + 2048
>>> 2048: split 1024 + 1024
>>> 1024: (4096...5119)
>>> 1024: (5120...6143)
>>> 2048: split 1024 + 1024
>>> 1024: (6144...7167)
>>> 1024: (7168...8191)
>>> 8192: split 4096 + 4096
>>> 4096: split 2048 + 2048
>>> 2048: split 1024 + 1024
>>> 1024: (8192...9215)
>>> 1024: (9216...10239)
>>> 2048: split 1024 + 1024
>>> 1024: (10240...11263)
>>> 1024: (11264...12287)
>>> 4096: split 2048 + 2048
>>> 2048: split 1024 + 1024
>>> 1024: (12288...13311)
>>> 1024: (13312...14335)
>>> 2048: split 1024 + 1024
>>> 1024: (14336...15359)
>>> 1024: (15360...16383)
>>>
>>> Total leafs: 16
>>>
>>>
>>> Regards, Peter
>>>
>
>
More information about the lambda-dev
mailing list