experience trying out lambda-8-b74

Sun Jan 27 15:06:20 PST 2013

There is an implementation almost ready, we need more test coverage on exceptional cases; Otherwise, the functionality seems to complete and pass tests we have.

http://cr.openjdk.java.net/~henryjen/lambda/nio.5/webrev/

With that, looks into Files.walk(Path), which iterate lazily as the stream is consumed, which should meet your need.

parallel() would be built-in once we have a general strategy on how to spliterate a sequential stream.

Cheers,
Henry

On Jan 27, 2013, at 12:45 PM, Gregg Wonderly <gregg at wonderly.org> wrote:

> One of the problems I have, is an application which can end up with 10's of thousands of files in a single directory.  I tried to get an iterator with call backs or opendir/readdir put into the JDK back in 1.3.1/1.4 to no avail.  I'd real like to have something as simple as opendir/readdir provided as a stream that is effective and non memory intensive.  I currently have to use JNI to do opendir/readdir, and I need to have a count function that used readdir or an appropriate mechanism to see when certain thresholds are passed.
> 
> Gregg Wonderly
> 
> Sent from my iPad
> 
> On Jan 27, 2013, at 12:47 PM, Brian Goetz <brian.goetz at oracle.com> wrote:
> 
>> Thanks, this is a nice little example.
>> 
>> First, I'll note we have been brainstorming on stream support for NIO 
>> directory streams, which might help in this situation.  This should come 
>> up for review here pretty soon, and it would be great if you could 
>> validate whether it addresses your concerns here.
>> 
>> The "recursive explosion" problem is an interesting one, and not one 
>> we'd explicitly baked into the design requirements for 
>> flatMap/mapMulti/explode.  (Note that Scala's flatMap, or .NET's 
>> SelectMany, do not explicitly handle this either.)  But your solution is 
>> pretty clean anyway!  You can lower the overhead you are concerned about 
>> by not creating a Stream but instead forEaching directly on the result 
>> of listFiles:
>> 
>> } else if (f.isDirectory()) {
>>     for (fi : f.listFiles())
>>         fileExploder(ds, fi);   /* recursive call ! */
>> }
>> 
>> Which isn't as bad.  You are still creating the array inside 
>> listFiles(), but at least then you create no intermediate objects to 
>> iterate over it / explode it.
>> 
>> BTW, the current (frequently confusing) design of explode is based on 
>> the exact performance concern you are worried about; exploding an 
>> element to nothing should result in no work.  Creating an empty 
>> collection, and then iterating it, would be a loss, which is why 
>> something that is more like map(Function<T, Collection<U>>) was not 
>> selected, despite being simpler to understand (and convenient in the 
>> common but by-no-means universal case where you already have a 
>> Collection lying around.)
>> 
>> Whether parallelization will be a win or is tricky, as you've 
>> discovered.  Effectively, your stream data is generated by explode(), 
>> not by the stream source, and this process is fundamentally serial.  So 
>> you will get benefit out of parallelization only if the work done to 
>> process the files (which could be as trivial as processing their names, 
>> or as huge as parsing gigabyte XML documents) is big enough to make up 
>> for the serial part.  Unfortunately, the library can't figure that out 
>> for you, only you know that.
>> 
>> That said, its interesting that you see different parallelization by 
>> adding the intermediate collect() stage.  This may well be a bug, we'll 
>> look into that.
>> 
>> On 1/27/2013 8:31 AM, Arne Siegel wrote:
>>> Hi lambda nation,
>>> 
>>> was wondering if I could lambda-ize this little file system crawler:
>>> 
>>>    public void forAllFilesDo(Block<File> fileConsumer, File baseFile) {
>>>        if (baseFile.isHidden()) {
>>>            // do nothing
>>>        } else if (baseFile.isFile()) {
>>>            fileConsumer.accept(baseFile);
>>>        } else if (baseFile.isDirectory()) {
>>>            File[] files = baseFile.listFiles();
>>>            Arrays.stream(files).forEach(f -> forAllFilesDo(fileConsumer, f));
>>>        }
>>>    }
>>> 
>>> Using b74, the following reformulation produces the identical behaviour:
>>> 
>>>    public void forAllFilesDo3(Block<File> fileConsumer, File baseFile) {
>>>        BiBlock<Stream.Downstream<File>,File> fileExploder = (ds, f) -> {
>>>            if (f.isHidden()) {
>>>                // do nothing
>>>            } else if (f.isFile()) {
>>>                ds.send(f);
>>>            } else if (f.isDirectory()) {
>>>                Arrays.stream(f.listFiles())
>>>                      .forEach(fi -> fileExploder(ds, fi));   /* recursive call ! */
>>>            }
>>>        };
>>>        Arrays.stream(new File[] {baseFile})
>>>              .explode(fileExploder)
>>>              .forEach(fileConsumer::accept);
>>>    }
>>> 
>>> As my next step I was looking if I could parallelize the last step. But simply inserting .parallel()
>>> seems not to be working - presumably because the initial one-element Spliterator is
>>> governing split behaviour.
>>> 
>>> Inserting .collect(Collectors.<File>toList()).parallelStream() instead works fine:
>>> 
>>>        Arrays.stream(new File[] {baseFile})
>>>              .explode(fileExploder)
>>>              .collect(Collectors.<File>toList())
>>>              .parallelStream()
>>>              .forEach(fileConsumer::accept);
>>> 
>>> Main observations:
>>> 
>>> (*) Exploding a single element requires a lot of programming overhead:
>>> 1st step - creation of a one-element array or collection
>>> 2nd step - stream the result of the 1st step
>>> 3rd step - explode the result of the 2nd step
>>> Not sure if anything can be done about this.
>>> 
>>> (*) .parallel() after .explode() behaves unexpectedly. Maybe the downstream should be based
>>> on a completely fresh unknown-size spliterator.
>>> 
>>> Regards
>>> Arne Siegel
>> 
>