Speed optimization of Spliterators.spliteratorUnknownSize for parallel case

Tue Jul 14 02:43:17 UTC 2015

Hello!

Thank you for the detailed answer.

PS> Thanks for looking at this. Judging by the results i am guessing
PS> your measurements were performed on a 4-core system.

Yes, quad-core, I mentioned it before.

PS> My initial inclination for these scenarios is it is likely better
PS> for the developer to split the execution in two parts. The first
PS> part, low N and low Q, is to collect sequentially into an array or
PS> list. The second part (total size is known), low N and high Q,
PS> would then use that array/list as the stream source and operate in parallel.

Probably it would be fine to add such recommendations to the
corresponding methods javaDoc (for example, BufferedReader.lines).

PS> That, in a sense, is almost what your spliterator does for n <
PS> initial batch size -:) but because the target threshold size is
PS> derived from the unknown root size (Long.MAX_VALUE) the
PS> sub-spliterators are not themselves further split, so the
PS> parallelism is better but still limited. It's tricky to adjust the
PS> target size dynamically, we could do it for the first split (e.g.
PS> unknown size -> two known size), but tracking further splits is
PS> harder and may not reliably perform (due to concurrent
PS> sub-splitting). I generally don't want to complicate the
PS> computation mechanism for poor splitting use-cases.

I see. That actually was also the thing which I wanted to suggest.
Still even improving the first split would be great, I'd vote for it.
This way AbstractSpliterator user implementations may also benefit.
My optimization cannot be applied to AbstractSpliterator as we cannot
control the tryAdvance method.

PS> Your approach also degenerates when N is just over the current batch size.

Yes, it becomes bad, but my measurement show that it's still no worse
than current implementation:

Benchmark                (parallel)  (size)  Mode  Cnt    Score    Error  Units
IteratorTest.base             false    1028  avgt   30  659.308 ± 12.618  us/op
IteratorTest.base             false    1100  avgt   30  746.619 ± 21.260  us/op
IteratorTest.base              true    1028  avgt   30  194.800 ±  1.988  us/op
IteratorTest.base              true    1100  avgt   30  228.926 ±  2.923  us/op
IteratorTest.jdkSpliterator   false    1028  avgt   30  648.622 ±  3.857  us/op
IteratorTest.jdkSpliterator   false    1100  avgt   30  750.741 ± 10.279  us/op
IteratorTest.jdkSpliterator    true    1028  avgt   30  673.574 ±  6.469  us/op
IteratorTest.jdkSpliterator    true    1100  avgt   30  679.499 ±  4.310  us/op
IteratorTest.optimized        false    1028  avgt   30  643.209 ±  1.686  us/op
IteratorTest.optimized        false    1100  avgt   30  718.077 ±  8.123  us/op
IteratorTest.optimized         true    1028  avgt   30  674.447 ±  5.153  us/op
IteratorTest.optimized         true    1100  avgt   30  674.622 ±  4.252  us/op

PS> It does not make as much sense to include such an estimate
PS> argument for Files.lines for two reasons:

PS> 1) I recently improved Files.lines for common character sets
PS> where it is efficient to identify encoded line feeds (such as UTF-8); and

PS> 2) for other character sets we could derive an estimate from the
PS> file size (it could actually be the file size, which is the case for 1).

I see. FileChannelLinesSpliterator is a nice improvement. Have you
tested it on device files, named pipes or cases when the file size
is changed in-between by some other process? Also probably it should
be mentioned in BufferedReader.lines documentation, that using
Files.lines for file source is preferred.

PS> So i am inclined to do the following:

PS> 1) consider adding a size estimate argument for all stream
PS> factories that are of unknown size and where no reasonable estimate can be derived.

PS> 2) modify the spliterator factories in Spliterators  to accept an estimate e.g.:

PS> public static <T> Spliterator<T>
PS> spliteratorUnknownSize(Iterator<? extends T> iterator,
PS>                                                         long sizeEstimate,
PS>                                                         int characteristics)

By the way currently it's possible to create an IteratorSpliterator
with estimated size:

Spliterators.spliterator(iterator, estimatedSize, Spliterator.CONCURRENT);

Of course it's a misuse of CONCURRENT flag, but it's not used
otherwisely by stream API, thus currently it has no unwanted
side-effects.

With best regards,
Tagir Valeev.

PS> 3) Possibly improve the final split logic of implementations in
PS> Spliterators so that it benefits known size, estimated size [*], and maybe unknown size.

PS> Paul.

PS> [*] if the known or estimated size remaining is less than the
PS> next batch size, then split in half as the final splitting action.
PS> In this case there is no need for an array field since the iterator can be reused.

PS> On Jul 10, 2015, at 8:01 PM, Tagir F. Valeev <amaembo at gmail.com> wrote:

>> Hello!
>> 
>> The problem of ineffective parallelization of spliteratorUnknownSize
>> for low-N/high-Q tasks was already discussed in the thread "Exploiting
>> concurrency with IteratorSpliterator of unknown size"
>> [ http://mail.openjdk.java.net/pipermail/lambda-dev/2014-March/011968.html ]
>> It was proposed to modify some heuristics and/or add some parameters
>> to control them.
>> 
>> I have a simpler idea which allows to run in parallel high-Q tasks for
>> N <~ 3000 about twice faster without any API changes while having no
>> visible impact for high-N or sequential cases. Here's the gist with
>> proof-of-concept implementation (UnknownSizeSpliterator.java), JMH
>> benchmark (IteratorTest.java) and benchmarking results (performed
>> on Quad-Core i5-3340, Win7 64bit):
>> https://gist.github.com/amaembo/781c64a3b4f48fdeb196
>> 
>> The problem which I noticed is the following. When trySplit is
>> requested and JDK IteratorSpliterator fills the buffer, it may
>> suddenly discover that the source iterator has no more elements. At
>> this point the IteratorSpliterator splits in very uneven manner.
>> Namely it dumps everything into the ArraySpliterator leaving no
>> elements for itself. As it did not report the size previously, the
>> pipeline engine assumes that both parts are roughly equal which
>> result in very poor performance.
>> 
>> I merged both IteratorSpliterator and ArraySpliterator into single
>> class UnknownSizeSpliterator. When it sees that input spliterator has
>> no more elements, it converts itself into array-mode and splits right
>> here to two equal parts, thus the first call of trySplit for N < 1024
>> performs actual split.
>> 
>> Note that I did not change the heuristics at all, thus for big-N
>> tasks the results should be the same. My implementation is a little
>> bit simplified (no additional characteristics, no precondition
>> checks, no primitive implementations), but I guess current version is
>> ok for preliminary testing to decide whether it's good or bad.
>> 
>> This implementation might improve the performance of existing methods
>> like BufferedReader.lines, Files.lines, Pattern.splitAsStream when
>> using the parallel stream as well as user-defined iterator-based
>> spliterators. Also it maybe useful for JDK-8072727 feature
>> [ https://bugs.openjdk.java.net/browse/JDK-8072727 ].
>> 
>> What do you think?
>> 
>> With best regards,
>> Tagir Valeev.
>>