experience trying out lambda-8-b74
Brian Goetz
brian.goetz at oracle.com
Sun Jan 27 10:47:06 PST 2013
Thanks, this is a nice little example.
First, I'll note we have been brainstorming on stream support for NIO
directory streams, which might help in this situation. This should come
up for review here pretty soon, and it would be great if you could
validate whether it addresses your concerns here.
The "recursive explosion" problem is an interesting one, and not one
we'd explicitly baked into the design requirements for
flatMap/mapMulti/explode. (Note that Scala's flatMap, or .NET's
SelectMany, do not explicitly handle this either.) But your solution is
pretty clean anyway! You can lower the overhead you are concerned about
by not creating a Stream but instead forEaching directly on the result
of listFiles:
} else if (f.isDirectory()) {
for (fi : f.listFiles())
fileExploder(ds, fi); /* recursive call ! */
}
Which isn't as bad. You are still creating the array inside
listFiles(), but at least then you create no intermediate objects to
iterate over it / explode it.
BTW, the current (frequently confusing) design of explode is based on
the exact performance concern you are worried about; exploding an
element to nothing should result in no work. Creating an empty
collection, and then iterating it, would be a loss, which is why
something that is more like map(Function<T, Collection<U>>) was not
selected, despite being simpler to understand (and convenient in the
common but by-no-means universal case where you already have a
Collection lying around.)
Whether parallelization will be a win or is tricky, as you've
discovered. Effectively, your stream data is generated by explode(),
not by the stream source, and this process is fundamentally serial. So
you will get benefit out of parallelization only if the work done to
process the files (which could be as trivial as processing their names,
or as huge as parsing gigabyte XML documents) is big enough to make up
for the serial part. Unfortunately, the library can't figure that out
for you, only you know that.
That said, its interesting that you see different parallelization by
adding the intermediate collect() stage. This may well be a bug, we'll
look into that.
On 1/27/2013 8:31 AM, Arne Siegel wrote:
> Hi lambda nation,
>
> was wondering if I could lambda-ize this little file system crawler:
>
> public void forAllFilesDo(Block<File> fileConsumer, File baseFile) {
> if (baseFile.isHidden()) {
> // do nothing
> } else if (baseFile.isFile()) {
> fileConsumer.accept(baseFile);
> } else if (baseFile.isDirectory()) {
> File[] files = baseFile.listFiles();
> Arrays.stream(files).forEach(f -> forAllFilesDo(fileConsumer, f));
> }
> }
>
> Using b74, the following reformulation produces the identical behaviour:
>
> public void forAllFilesDo3(Block<File> fileConsumer, File baseFile) {
> BiBlock<Stream.Downstream<File>,File> fileExploder = (ds, f) -> {
> if (f.isHidden()) {
> // do nothing
> } else if (f.isFile()) {
> ds.send(f);
> } else if (f.isDirectory()) {
> Arrays.stream(f.listFiles())
> .forEach(fi -> fileExploder(ds, fi)); /* recursive call ! */
> }
> };
> Arrays.stream(new File[] {baseFile})
> .explode(fileExploder)
> .forEach(fileConsumer::accept);
> }
>
> As my next step I was looking if I could parallelize the last step. But simply inserting .parallel()
> seems not to be working - presumably because the initial one-element Spliterator is
> governing split behaviour.
>
> Inserting .collect(Collectors.<File>toList()).parallelStream() instead works fine:
>
> Arrays.stream(new File[] {baseFile})
> .explode(fileExploder)
> .collect(Collectors.<File>toList())
> .parallelStream()
> .forEach(fileConsumer::accept);
>
> Main observations:
>
> (*) Exploding a single element requires a lot of programming overhead:
> 1st step - creation of a one-element array or collection
> 2nd step - stream the result of the 1st step
> 3rd step - explode the result of the 2nd step
> Not sure if anything can be done about this.
>
> (*) .parallel() after .explode() behaves unexpectedly. Maybe the downstream should be based
> on a completely fresh unknown-size spliterator.
>
> Regards
> Arne Siegel
>
>
More information about the lambda-dev
mailing list