experience trying out lambda-8-b74

Sun Jan 27 12:45:28 PST 2013

One of the problems I have, is an application which can end up with 10's of thousands of files in a single directory.  I tried to get an iterator with call backs or opendir/readdir put into the JDK back in 1.3.1/1.4 to no avail.  I'd real like to have something as simple as opendir/readdir provided as a stream that is effective and non memory intensive.  I currently have to use JNI to do opendir/readdir, and I need to have a count function that used readdir or an appropriate mechanism to see when certain thresholds are passed.

Gregg Wonderly

Sent from my iPad

On Jan 27, 2013, at 12:47 PM, Brian Goetz <brian.goetz at oracle.com> wrote:

> Thanks, this is a nice little example.
> 
> First, I'll note we have been brainstorming on stream support for NIO 
> directory streams, which might help in this situation.  This should come 
> up for review here pretty soon, and it would be great if you could 
> validate whether it addresses your concerns here.
> 
> The "recursive explosion" problem is an interesting one, and not one 
> we'd explicitly baked into the design requirements for 
> flatMap/mapMulti/explode.  (Note that Scala's flatMap, or .NET's 
> SelectMany, do not explicitly handle this either.)  But your solution is 
> pretty clean anyway!  You can lower the overhead you are concerned about 
> by not creating a Stream but instead forEaching directly on the result 
> of listFiles:
> 
>  } else if (f.isDirectory()) {
>      for (fi : f.listFiles())
>          fileExploder(ds, fi);   /* recursive call ! */
>  }
> 
> Which isn't as bad.  You are still creating the array inside 
> listFiles(), but at least then you create no intermediate objects to 
> iterate over it / explode it.
> 
> BTW, the current (frequently confusing) design of explode is based on 
> the exact performance concern you are worried about; exploding an 
> element to nothing should result in no work.  Creating an empty 
> collection, and then iterating it, would be a loss, which is why 
> something that is more like map(Function<T, Collection<U>>) was not 
> selected, despite being simpler to understand (and convenient in the 
> common but by-no-means universal case where you already have a 
> Collection lying around.)
> 
> Whether parallelization will be a win or is tricky, as you've 
> discovered.  Effectively, your stream data is generated by explode(), 
> not by the stream source, and this process is fundamentally serial.  So 
> you will get benefit out of parallelization only if the work done to 
> process the files (which could be as trivial as processing their names, 
> or as huge as parsing gigabyte XML documents) is big enough to make up 
> for the serial part.  Unfortunately, the library can't figure that out 
> for you, only you know that.
> 
> That said, its interesting that you see different parallelization by 
> adding the intermediate collect() stage.  This may well be a bug, we'll 
> look into that.
> 
> On 1/27/2013 8:31 AM, Arne Siegel wrote:
>> Hi lambda nation,
>> 
>> was wondering if I could lambda-ize this little file system crawler:
>> 
>>     public void forAllFilesDo(Block<File> fileConsumer, File baseFile) {
>>         if (baseFile.isHidden()) {
>>             // do nothing
>>         } else if (baseFile.isFile()) {
>>             fileConsumer.accept(baseFile);
>>         } else if (baseFile.isDirectory()) {
>>             File[] files = baseFile.listFiles();
>>             Arrays.stream(files).forEach(f -> forAllFilesDo(fileConsumer, f));
>>         }
>>     }
>> 
>> Using b74, the following reformulation produces the identical behaviour:
>> 
>>     public void forAllFilesDo3(Block<File> fileConsumer, File baseFile) {
>>         BiBlock<Stream.Downstream<File>,File> fileExploder = (ds, f) -> {
>>             if (f.isHidden()) {
>>                 // do nothing
>>             } else if (f.isFile()) {
>>                 ds.send(f);
>>             } else if (f.isDirectory()) {
>>                 Arrays.stream(f.listFiles())
>>                       .forEach(fi -> fileExploder(ds, fi));   /* recursive call ! */
>>             }
>>         };
>>         Arrays.stream(new File[] {baseFile})
>>               .explode(fileExploder)
>>               .forEach(fileConsumer::accept);
>>     }
>> 
>> As my next step I was looking if I could parallelize the last step. But simply inserting .parallel()
>> seems not to be working - presumably because the initial one-element Spliterator is
>> governing split behaviour.
>> 
>> Inserting .collect(Collectors.<File>toList()).parallelStream() instead works fine:
>> 
>>         Arrays.stream(new File[] {baseFile})
>>               .explode(fileExploder)
>>               .collect(Collectors.<File>toList())
>>               .parallelStream()
>>               .forEach(fileConsumer::accept);
>> 
>> Main observations:
>> 
>> (*) Exploding a single element requires a lot of programming overhead:
>> 1st step - creation of a one-element array or collection
>> 2nd step - stream the result of the 1st step
>> 3rd step - explode the result of the 2nd step
>> Not sure if anything can be done about this.
>> 
>> (*) .parallel() after .explode() behaves unexpectedly. Maybe the downstream should be based
>> on a completely fresh unknown-size spliterator.
>> 
>> Regards
>> Arne Siegel
>