Speed optimization of Spliterators.spliteratorUnknownSize for parallel case
Hello! The problem of ineffective parallelization of spliteratorUnknownSize for low-N/high-Q tasks was already discussed in the thread "Exploiting concurrency with IteratorSpliterator of unknown size" [ http://mail.openjdk.java.net/pipermail/lambda-dev/2014-March/011968.html ] It was proposed to modify some heuristics and/or add some parameters to control them. I have a simpler idea which allows to run in parallel high-Q tasks for N <~ 3000 about twice faster without any API changes while having no visible impact for high-N or sequential cases. Here's the gist with proof-of-concept implementation (UnknownSizeSpliterator.java), JMH benchmark (IteratorTest.java) and benchmarking results (performed on Quad-Core i5-3340, Win7 64bit): https://gist.github.com/amaembo/781c64a3b4f48fdeb196 The problem which I noticed is the following. When trySplit is requested and JDK IteratorSpliterator fills the buffer, it may suddenly discover that the source iterator has no more elements. At this point the IteratorSpliterator splits in very uneven manner. Namely it dumps everything into the ArraySpliterator leaving no elements for itself. As it did not report the size previously, the pipeline engine assumes that both parts are roughly equal which result in very poor performance. I merged both IteratorSpliterator and ArraySpliterator into single class UnknownSizeSpliterator. When it sees that input spliterator has no more elements, it converts itself into array-mode and splits right here to two equal parts, thus the first call of trySplit for N < 1024 performs actual split. Note that I did not change the heuristics at all, thus for big-N tasks the results should be the same. My implementation is a little bit simplified (no additional characteristics, no precondition checks, no primitive implementations), but I guess current version is ok for preliminary testing to decide whether it's good or bad. This implementation might improve the performance of existing methods like BufferedReader.lines, Files.lines, Pattern.splitAsStream when using the parallel stream as well as user-defined iterator-based spliterators. Also it maybe useful for JDK-8072727 feature [ https://bugs.openjdk.java.net/browse/JDK-8072727 ]. What do you think? With best regards, Tagir Valeev.
Hi Tagir, Thanks for looking at this. Judging by the results i am guessing your measurements were performed on a 4-core system. My initial inclination for these scenarios is it is likely better for the developer to split the execution in two parts. The first part, low N and low Q, is to collect sequentially into an array or list. The second part (total size is known), low N and high Q, would then use that array/list as the stream source and operate in parallel. That, in a sense, is almost what your spliterator does for n < initial batch size -:) but because the target threshold size is derived from the unknown root size (Long.MAX_VALUE) the sub-spliterators are not themselves further split, so the parallelism is better but still limited. It's tricky to adjust the target size dynamically, we could do it for the first split (e.g. unknown size -> two known size), but tracking further splits is harder and may not reliably perform (due to concurrent sub-splitting). I generally don't want to complicate the computation mechanism for poor splitting use-cases. Your approach also degenerates when N is just over the current batch size. I like what you have done, but i am hesitating. While we can marginally improve splitting from a sequential source use-cases they are nearly always gonna behave poorly so it feels like putting "lipstick on a pig" :-) Having said that the splitting logic is a little dumb returning null when all elements are consumed, for simplification purposes given that splitting is always poor and it does not matter so much if the size or a good estimate is known. For some additional implementation complexity there are marginal gains to be had, so it can seem like a reasonable thing to do. Just not sure it's enough. I think larger gains may likely be had if it were possible to pass size estimates to certain stream source methods whose size is unknown e.g. passing in an estimate of the number of expected lines in BufferedReader.lines() to potentially achieve results closer to your base measurement case. It does not make as much sense to include such an estimate argument for Files.lines for two reasons: 1) I recently improved Files.lines for common character sets where it is efficient to identify encoded line feeds (such as UTF-8); and 2) for other character sets we could derive an estimate from the file size (it could actually be the file size, which is the case for 1). So i am inclined to do the following: 1) consider adding a size estimate argument for all stream factories that are of unknown size and where no reasonable estimate can be derived. 2) modify the spliterator factories in Spliterators to accept an estimate e.g.: public static <T> Spliterator<T> spliteratorUnknownSize(Iterator<? extends T> iterator, long sizeEstimate, int characteristics) 3) Possibly improve the final split logic of implementations in Spliterators so that it benefits known size, estimated size [*], and maybe unknown size. Paul. [*] if the known or estimated size remaining is less than the next batch size, then split in half as the final splitting action. In this case there is no need for an array field since the iterator can be reused. On Jul 10, 2015, at 8:01 PM, Tagir F. Valeev <amaembo@gmail.com> wrote:
Hello!
The problem of ineffective parallelization of spliteratorUnknownSize for low-N/high-Q tasks was already discussed in the thread "Exploiting concurrency with IteratorSpliterator of unknown size" [ http://mail.openjdk.java.net/pipermail/lambda-dev/2014-March/011968.html ] It was proposed to modify some heuristics and/or add some parameters to control them.
I have a simpler idea which allows to run in parallel high-Q tasks for N <~ 3000 about twice faster without any API changes while having no visible impact for high-N or sequential cases. Here's the gist with proof-of-concept implementation (UnknownSizeSpliterator.java), JMH benchmark (IteratorTest.java) and benchmarking results (performed on Quad-Core i5-3340, Win7 64bit): https://gist.github.com/amaembo/781c64a3b4f48fdeb196
The problem which I noticed is the following. When trySplit is requested and JDK IteratorSpliterator fills the buffer, it may suddenly discover that the source iterator has no more elements. At this point the IteratorSpliterator splits in very uneven manner. Namely it dumps everything into the ArraySpliterator leaving no elements for itself. As it did not report the size previously, the pipeline engine assumes that both parts are roughly equal which result in very poor performance.
I merged both IteratorSpliterator and ArraySpliterator into single class UnknownSizeSpliterator. When it sees that input spliterator has no more elements, it converts itself into array-mode and splits right here to two equal parts, thus the first call of trySplit for N < 1024 performs actual split.
Note that I did not change the heuristics at all, thus for big-N tasks the results should be the same. My implementation is a little bit simplified (no additional characteristics, no precondition checks, no primitive implementations), but I guess current version is ok for preliminary testing to decide whether it's good or bad.
This implementation might improve the performance of existing methods like BufferedReader.lines, Files.lines, Pattern.splitAsStream when using the parallel stream as well as user-defined iterator-based spliterators. Also it maybe useful for JDK-8072727 feature [ https://bugs.openjdk.java.net/browse/JDK-8072727 ].
What do you think?
With best regards, Tagir Valeev.
Hello! Thank you for the detailed answer. PS> Thanks for looking at this. Judging by the results i am guessing PS> your measurements were performed on a 4-core system. Yes, quad-core, I mentioned it before. PS> My initial inclination for these scenarios is it is likely better PS> for the developer to split the execution in two parts. The first PS> part, low N and low Q, is to collect sequentially into an array or PS> list. The second part (total size is known), low N and high Q, PS> would then use that array/list as the stream source and operate in parallel. Probably it would be fine to add such recommendations to the corresponding methods javaDoc (for example, BufferedReader.lines). PS> That, in a sense, is almost what your spliterator does for n < PS> initial batch size -:) but because the target threshold size is PS> derived from the unknown root size (Long.MAX_VALUE) the PS> sub-spliterators are not themselves further split, so the PS> parallelism is better but still limited. It's tricky to adjust the PS> target size dynamically, we could do it for the first split (e.g. PS> unknown size -> two known size), but tracking further splits is PS> harder and may not reliably perform (due to concurrent PS> sub-splitting). I generally don't want to complicate the PS> computation mechanism for poor splitting use-cases. I see. That actually was also the thing which I wanted to suggest. Still even improving the first split would be great, I'd vote for it. This way AbstractSpliterator user implementations may also benefit. My optimization cannot be applied to AbstractSpliterator as we cannot control the tryAdvance method. PS> Your approach also degenerates when N is just over the current batch size. Yes, it becomes bad, but my measurement show that it's still no worse than current implementation: Benchmark (parallel) (size) Mode Cnt Score Error Units IteratorTest.base false 1028 avgt 30 659.308 ± 12.618 us/op IteratorTest.base false 1100 avgt 30 746.619 ± 21.260 us/op IteratorTest.base true 1028 avgt 30 194.800 ± 1.988 us/op IteratorTest.base true 1100 avgt 30 228.926 ± 2.923 us/op IteratorTest.jdkSpliterator false 1028 avgt 30 648.622 ± 3.857 us/op IteratorTest.jdkSpliterator false 1100 avgt 30 750.741 ± 10.279 us/op IteratorTest.jdkSpliterator true 1028 avgt 30 673.574 ± 6.469 us/op IteratorTest.jdkSpliterator true 1100 avgt 30 679.499 ± 4.310 us/op IteratorTest.optimized false 1028 avgt 30 643.209 ± 1.686 us/op IteratorTest.optimized false 1100 avgt 30 718.077 ± 8.123 us/op IteratorTest.optimized true 1028 avgt 30 674.447 ± 5.153 us/op IteratorTest.optimized true 1100 avgt 30 674.622 ± 4.252 us/op PS> It does not make as much sense to include such an estimate PS> argument for Files.lines for two reasons: PS> 1) I recently improved Files.lines for common character sets PS> where it is efficient to identify encoded line feeds (such as UTF-8); and PS> 2) for other character sets we could derive an estimate from the PS> file size (it could actually be the file size, which is the case for 1). I see. FileChannelLinesSpliterator is a nice improvement. Have you tested it on device files, named pipes or cases when the file size is changed in-between by some other process? Also probably it should be mentioned in BufferedReader.lines documentation, that using Files.lines for file source is preferred. PS> So i am inclined to do the following: PS> 1) consider adding a size estimate argument for all stream PS> factories that are of unknown size and where no reasonable estimate can be derived. PS> 2) modify the spliterator factories in Spliterators to accept an estimate e.g.: PS> public static <T> Spliterator<T> PS> spliteratorUnknownSize(Iterator<? extends T> iterator, PS> long sizeEstimate, PS> int characteristics) By the way currently it's possible to create an IteratorSpliterator with estimated size: Spliterators.spliterator(iterator, estimatedSize, Spliterator.CONCURRENT); Of course it's a misuse of CONCURRENT flag, but it's not used otherwisely by stream API, thus currently it has no unwanted side-effects. With best regards, Tagir Valeev. PS> 3) Possibly improve the final split logic of implementations in PS> Spliterators so that it benefits known size, estimated size [*], and maybe unknown size. PS> Paul. PS> [*] if the known or estimated size remaining is less than the PS> next batch size, then split in half as the final splitting action. PS> In this case there is no need for an array field since the iterator can be reused. PS> On Jul 10, 2015, at 8:01 PM, Tagir F. Valeev <amaembo@gmail.com> wrote:
Hello!
The problem of ineffective parallelization of spliteratorUnknownSize for low-N/high-Q tasks was already discussed in the thread "Exploiting concurrency with IteratorSpliterator of unknown size" [ http://mail.openjdk.java.net/pipermail/lambda-dev/2014-March/011968.html ] It was proposed to modify some heuristics and/or add some parameters to control them.
I have a simpler idea which allows to run in parallel high-Q tasks for N <~ 3000 about twice faster without any API changes while having no visible impact for high-N or sequential cases. Here's the gist with proof-of-concept implementation (UnknownSizeSpliterator.java), JMH benchmark (IteratorTest.java) and benchmarking results (performed on Quad-Core i5-3340, Win7 64bit): https://gist.github.com/amaembo/781c64a3b4f48fdeb196
The problem which I noticed is the following. When trySplit is requested and JDK IteratorSpliterator fills the buffer, it may suddenly discover that the source iterator has no more elements. At this point the IteratorSpliterator splits in very uneven manner. Namely it dumps everything into the ArraySpliterator leaving no elements for itself. As it did not report the size previously, the pipeline engine assumes that both parts are roughly equal which result in very poor performance.
I merged both IteratorSpliterator and ArraySpliterator into single class UnknownSizeSpliterator. When it sees that input spliterator has no more elements, it converts itself into array-mode and splits right here to two equal parts, thus the first call of trySplit for N < 1024 performs actual split.
Note that I did not change the heuristics at all, thus for big-N tasks the results should be the same. My implementation is a little bit simplified (no additional characteristics, no precondition checks, no primitive implementations), but I guess current version is ok for preliminary testing to decide whether it's good or bad.
This implementation might improve the performance of existing methods like BufferedReader.lines, Files.lines, Pattern.splitAsStream when using the parallel stream as well as user-defined iterator-based spliterators. Also it maybe useful for JDK-8072727 feature [ https://bugs.openjdk.java.net/browse/JDK-8072727 ].
What do you think?
With best regards, Tagir Valeev.
Hi Tagir, On Jul 14, 2015, at 4:43 AM, Tagir F. Valeev <amaembo@gmail.com> wrote:
Hello!
Thank you for the detailed answer.
PS> Thanks for looking at this. Judging by the results i am guessing PS> your measurements were performed on a 4-core system.
Yes, quad-core, I mentioned it before.
PS> My initial inclination for these scenarios is it is likely better PS> for the developer to split the execution in two parts. The first PS> part, low N and low Q, is to collect sequentially into an array or PS> list. The second part (total size is known), low N and high Q, PS> would then use that array/list as the stream source and operate in parallel.
Probably it would be fine to add such recommendations to the corresponding methods javaDoc (for example, BufferedReader.lines).
I think it is more suitable for: http://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.html
PS> That, in a sense, is almost what your spliterator does for n < PS> initial batch size -:) but because the target threshold size is PS> derived from the unknown root size (Long.MAX_VALUE) the PS> sub-spliterators are not themselves further split, so the PS> parallelism is better but still limited. It's tricky to adjust the PS> target size dynamically, we could do it for the first split (e.g. PS> unknown size -> two known size), but tracking further splits is PS> harder and may not reliably perform (due to concurrent PS> sub-splitting). I generally don't want to complicate the PS> computation mechanism for poor splitting use-cases.
I see. That actually was also the thing which I wanted to suggest. Still even improving the first split would be great, I'd vote for it.
I am still hesitating. It's not general enough (optimize for poor splitting under fragile scenarios for N elements under batch size) (FWIW the general computation already alternates left/right forking to avoid such spliterators causing out of memory issues where the computation runs away creating tasks). I also pondered exposing the batch size as an argument, but i don't think developers will generally know what to do with it, where as i suspect a size estimate is much easier to understand (even if guarded language in terms of proportions e.g last least N). And of course this does not stop us internally using an alternative spliterator implementation for BufferedReader.lines such as suggested by Marko, which i think could be improved if a size estimate is given, as this could also influence the batch size and common difference.
This way AbstractSpliterator user implementations may also benefit. My optimization cannot be applied to AbstractSpliterator as we cannot control the tryAdvance method.
PS> Your approach also degenerates when N is just over the current batch size.
Yes, it becomes bad, but my measurement show that it's still no worse than current implementation:
Yes, no worse.
Benchmark (parallel) (size) Mode Cnt Score Error Units IteratorTest.base false 1028 avgt 30 659.308 ± 12.618 us/op IteratorTest.base false 1100 avgt 30 746.619 ± 21.260 us/op IteratorTest.base true 1028 avgt 30 194.800 ± 1.988 us/op IteratorTest.base true 1100 avgt 30 228.926 ± 2.923 us/op IteratorTest.jdkSpliterator false 1028 avgt 30 648.622 ± 3.857 us/op IteratorTest.jdkSpliterator false 1100 avgt 30 750.741 ± 10.279 us/op IteratorTest.jdkSpliterator true 1028 avgt 30 673.574 ± 6.469 us/op IteratorTest.jdkSpliterator true 1100 avgt 30 679.499 ± 4.310 us/op IteratorTest.optimized false 1028 avgt 30 643.209 ± 1.686 us/op IteratorTest.optimized false 1100 avgt 30 718.077 ± 8.123 us/op IteratorTest.optimized true 1028 avgt 30 674.447 ± 5.153 us/op IteratorTest.optimized true 1100 avgt 30 674.622 ± 4.252 us/op
PS> It does not make as much sense to include such an estimate PS> argument for Files.lines for two reasons:
PS> 1) I recently improved Files.lines for common character sets PS> where it is efficient to identify encoded line feeds (such as UTF-8); and
PS> 2) for other character sets we could derive an estimate from the PS> file size (it could actually be the file size, which is the case for 1).
I see. FileChannelLinesSpliterator is a nice improvement. Have you tested it on device files, named pipes or cases when the file size is changed in-between by some other process?
No. There is a general clause saying "all bets are off" if the file contents are modified during stream execution. The size is currently snapshot on the Files.lines call. If the file size increases in the interim i suspect it will still work, if it reduces then i would expect an I/O exception.
Also probably it should be mentioned in BufferedReader.lines documentation, that using Files.lines for file source is preferred.
Good point a "@see Files.lines ..." would be appropriate.
PS> So i am inclined to do the following:
PS> 1) consider adding a size estimate argument for all stream PS> factories that are of unknown size and where no reasonable estimate can be derived.
PS> 2) modify the spliterator factories in Spliterators to accept an estimate e.g.:
PS> public static <T> Spliterator<T> PS> spliteratorUnknownSize(Iterator<? extends T> iterator, PS> long sizeEstimate, PS> int characteristics)
By the way currently it's possible to create an IteratorSpliterator with estimated size:
Spliterators.spliterator(iterator, estimatedSize, Spliterator.CONCURRENT);
Of course it's a misuse of CONCURRENT flag, but it's not used otherwisely by stream API, thus currently it has no unwanted side-effects.
Very sneaky :-) but not something we cannot encourage. Paul.
On Jul 15, 2015, at 2:20 PM, Paul Sandoz <Paul.Sandoz@oracle.com> wrote:
By the way currently it's possible to create an IteratorSpliterator with estimated size:
Spliterators.spliterator(iterator, estimatedSize, Spliterator.CONCURRENT);
Of course it's a misuse of CONCURRENT flag, but it's not used otherwisely by stream API, thus currently it has no unwanted side-effects.
Very sneaky :-) but not something we cannot encourage.
I meant to say: ... but something we cannot encourage. Paul.
participants (2)
-
Paul Sandoz
-
Tagir F. Valeev