experience trying out lambda-8-b74

Sun Jan 27 10:47:06 PST 2013

Thanks, this is a nice little example.

First, I'll note we have been brainstorming on stream support for NIO 
directory streams, which might help in this situation.  This should come 
up for review here pretty soon, and it would be great if you could 
validate whether it addresses your concerns here.

The "recursive explosion" problem is an interesting one, and not one 
we'd explicitly baked into the design requirements for 
flatMap/mapMulti/explode.  (Note that Scala's flatMap, or .NET's 
SelectMany, do not explicitly handle this either.)  But your solution is 
pretty clean anyway!  You can lower the overhead you are concerned about 
by not creating a Stream but instead forEaching directly on the result 
of listFiles:

  } else if (f.isDirectory()) {
      for (fi : f.listFiles())
          fileExploder(ds, fi);   /* recursive call ! */
  }

Which isn't as bad.  You are still creating the array inside 
listFiles(), but at least then you create no intermediate objects to 
iterate over it / explode it.

BTW, the current (frequently confusing) design of explode is based on 
the exact performance concern you are worried about; exploding an 
element to nothing should result in no work.  Creating an empty 
collection, and then iterating it, would be a loss, which is why 
something that is more like map(Function<T, Collection<U>>) was not 
selected, despite being simpler to understand (and convenient in the 
common but by-no-means universal case where you already have a 
Collection lying around.)

Whether parallelization will be a win or is tricky, as you've 
discovered.  Effectively, your stream data is generated by explode(), 
not by the stream source, and this process is fundamentally serial.  So 
you will get benefit out of parallelization only if the work done to 
process the files (which could be as trivial as processing their names, 
or as huge as parsing gigabyte XML documents) is big enough to make up 
for the serial part.  Unfortunately, the library can't figure that out 
for you, only you know that.

That said, its interesting that you see different parallelization by 
adding the intermediate collect() stage.  This may well be a bug, we'll 
look into that.

On 1/27/2013 8:31 AM, Arne Siegel wrote:
> Hi lambda nation,
>
> was wondering if I could lambda-ize this little file system crawler:
>
>      public void forAllFilesDo(Block<File> fileConsumer, File baseFile) {
>          if (baseFile.isHidden()) {
>              // do nothing
>          } else if (baseFile.isFile()) {
>              fileConsumer.accept(baseFile);
>          } else if (baseFile.isDirectory()) {
>              File[] files = baseFile.listFiles();
>              Arrays.stream(files).forEach(f -> forAllFilesDo(fileConsumer, f));
>          }
>      }
>
> Using b74, the following reformulation produces the identical behaviour:
>
>      public void forAllFilesDo3(Block<File> fileConsumer, File baseFile) {
>          BiBlock<Stream.Downstream<File>,File> fileExploder = (ds, f) -> {
>              if (f.isHidden()) {
>                  // do nothing
>              } else if (f.isFile()) {
>                  ds.send(f);
>              } else if (f.isDirectory()) {
>                  Arrays.stream(f.listFiles())
>                        .forEach(fi -> fileExploder(ds, fi));   /* recursive call ! */
>              }
>          };
>          Arrays.stream(new File[] {baseFile})
>                .explode(fileExploder)
>                .forEach(fileConsumer::accept);
>      }
>
> As my next step I was looking if I could parallelize the last step. But simply inserting .parallel()
> seems not to be working - presumably because the initial one-element Spliterator is
> governing split behaviour.
>
> Inserting .collect(Collectors.<File>toList()).parallelStream() instead works fine:
>
>          Arrays.stream(new File[] {baseFile})
>                .explode(fileExploder)
>                .collect(Collectors.<File>toList())
>                .parallelStream()
>                .forEach(fileConsumer::accept);
>
> Main observations:
>
> (*) Exploding a single element requires a lot of programming overhead:
> 1st step - creation of a one-element array or collection
> 2nd step - stream the result of the 1st step
> 3rd step - explode the result of the 2nd step
> Not sure if anything can be done about this.
>
> (*) .parallel() after .explode() behaves unexpectedly. Maybe the downstream should be based
> on a completely fresh unknown-size spliterator.
>
> Regards
> Arne Siegel
>
>