Speed optimization of Spliterators.spliteratorUnknownSize for parallel case

Mon Jul 13 12:10:42 UTC 2015

Hi Tagir,

Thanks for looking at this. Judging by the results i am guessing your measurements were performed on a 4-core system.

My initial inclination for these scenarios is it is likely better for the developer to split the execution in two parts. The first part, low N and low Q, is to collect sequentially into an array or list. The second part (total size is known), low N and high Q, would then use that array/list as the stream source and operate in parallel.

That, in a sense, is almost what your spliterator does for n < initial batch size -:) but because the target threshold size is derived from the unknown root size (Long.MAX_VALUE) the sub-spliterators are not themselves further split, so the parallelism is better but still limited. It's tricky to adjust the target size dynamically, we could do it for the first split (e.g. unknown size -> two known size), but tracking further splits is harder and may not reliably perform (due to concurrent sub-splitting). I generally don't want to complicate the computation mechanism for poor splitting use-cases.

Your approach also degenerates when N is just over the current batch size.

I like what you have done, but i am hesitating. While we can marginally improve splitting from a sequential source use-cases they are nearly always gonna behave poorly so it feels like putting "lipstick on a pig" :-) Having said that the splitting logic is a little dumb returning null when all elements are consumed, for simplification purposes given that splitting is always poor and it does not matter so much if the size or a good estimate is known. For some additional implementation complexity there are marginal gains to be had, so it can seem like a reasonable thing to do. Just not sure it's enough.

I think larger gains may likely be had if it were possible to pass size estimates to certain stream source methods whose size is unknown e.g. passing in an estimate of the number of expected lines in BufferedReader.lines() to potentially achieve results closer to your base measurement case.

It does not make as much sense to include such an estimate argument for Files.lines for two reasons:

1) I recently improved Files.lines for common character sets where it is efficient to identify encoded line feeds (such as UTF-8); and

2) for other character sets we could derive an estimate from the file size (it could actually be the file size, which is the case for 1).

So i am inclined to do the following:

1) consider adding a size estimate argument for all stream factories that are of unknown size and where no reasonable estimate can be derived.

2) modify the spliterator factories in Spliterators  to accept an estimate e.g.:

public static <T> Spliterator<T> spliteratorUnknownSize(Iterator<? extends T> iterator,
                                                        long sizeEstimate,
                                                        int characteristics)

3) Possibly improve the final split logic of implementations in Spliterators so that it benefits known size, estimated size [*], and maybe unknown size.

Paul.

[*] if the known or estimated size remaining is less than the next batch size, then split in half as the final splitting action. In this case there is no need for an array field since the iterator can be reused.

On Jul 10, 2015, at 8:01 PM, Tagir F. Valeev <amaembo at gmail.com> wrote:

> Hello!
> 
> The problem of ineffective parallelization of spliteratorUnknownSize
> for low-N/high-Q tasks was already discussed in the thread "Exploiting
> concurrency with IteratorSpliterator of unknown size"
> [ http://mail.openjdk.java.net/pipermail/lambda-dev/2014-March/011968.html ]
> It was proposed to modify some heuristics and/or add some parameters
> to control them.
> 
> I have a simpler idea which allows to run in parallel high-Q tasks for
> N <~ 3000 about twice faster without any API changes while having no
> visible impact for high-N or sequential cases. Here's the gist with
> proof-of-concept implementation (UnknownSizeSpliterator.java), JMH
> benchmark (IteratorTest.java) and benchmarking results (performed
> on Quad-Core i5-3340, Win7 64bit):
> https://gist.github.com/amaembo/781c64a3b4f48fdeb196
> 
> The problem which I noticed is the following. When trySplit is
> requested and JDK IteratorSpliterator fills the buffer, it may
> suddenly discover that the source iterator has no more elements. At
> this point the IteratorSpliterator splits in very uneven manner.
> Namely it dumps everything into the ArraySpliterator leaving no
> elements for itself. As it did not report the size previously, the
> pipeline engine assumes that both parts are roughly equal which
> result in very poor performance.
> 
> I merged both IteratorSpliterator and ArraySpliterator into single
> class UnknownSizeSpliterator. When it sees that input spliterator has
> no more elements, it converts itself into array-mode and splits right
> here to two equal parts, thus the first call of trySplit for N < 1024
> performs actual split.
> 
> Note that I did not change the heuristics at all, thus for big-N
> tasks the results should be the same. My implementation is a little
> bit simplified (no additional characteristics, no precondition
> checks, no primitive implementations), but I guess current version is
> ok for preliminary testing to decide whether it's good or bad.
> 
> This implementation might improve the performance of existing methods
> like BufferedReader.lines, Files.lines, Pattern.splitAsStream when
> using the parallel stream as well as user-defined iterator-based
> spliterators. Also it maybe useful for JDK-8072727 feature
> [ https://bugs.openjdk.java.net/browse/JDK-8072727 ].
> 
> What do you think?
> 
> With best regards,
> Tagir Valeev.
>