experience trying out lambda-8-b74
Henry Jen
henry.jen at oracle.com
Sun Jan 27 15:06:20 PST 2013
There is an implementation almost ready, we need more test coverage on exceptional cases; Otherwise, the functionality seems to complete and pass tests we have.
http://cr.openjdk.java.net/~henryjen/lambda/nio.5/webrev/
With that, looks into Files.walk(Path), which iterate lazily as the stream is consumed, which should meet your need.
parallel() would be built-in once we have a general strategy on how to spliterate a sequential stream.
Cheers,
Henry
On Jan 27, 2013, at 12:45 PM, Gregg Wonderly <gregg at wonderly.org> wrote:
> One of the problems I have, is an application which can end up with 10's of thousands of files in a single directory. I tried to get an iterator with call backs or opendir/readdir put into the JDK back in 1.3.1/1.4 to no avail. I'd real like to have something as simple as opendir/readdir provided as a stream that is effective and non memory intensive. I currently have to use JNI to do opendir/readdir, and I need to have a count function that used readdir or an appropriate mechanism to see when certain thresholds are passed.
>
> Gregg Wonderly
>
> Sent from my iPad
>
> On Jan 27, 2013, at 12:47 PM, Brian Goetz <brian.goetz at oracle.com> wrote:
>
>> Thanks, this is a nice little example.
>>
>> First, I'll note we have been brainstorming on stream support for NIO
>> directory streams, which might help in this situation. This should come
>> up for review here pretty soon, and it would be great if you could
>> validate whether it addresses your concerns here.
>>
>> The "recursive explosion" problem is an interesting one, and not one
>> we'd explicitly baked into the design requirements for
>> flatMap/mapMulti/explode. (Note that Scala's flatMap, or .NET's
>> SelectMany, do not explicitly handle this either.) But your solution is
>> pretty clean anyway! You can lower the overhead you are concerned about
>> by not creating a Stream but instead forEaching directly on the result
>> of listFiles:
>>
>> } else if (f.isDirectory()) {
>> for (fi : f.listFiles())
>> fileExploder(ds, fi); /* recursive call ! */
>> }
>>
>> Which isn't as bad. You are still creating the array inside
>> listFiles(), but at least then you create no intermediate objects to
>> iterate over it / explode it.
>>
>> BTW, the current (frequently confusing) design of explode is based on
>> the exact performance concern you are worried about; exploding an
>> element to nothing should result in no work. Creating an empty
>> collection, and then iterating it, would be a loss, which is why
>> something that is more like map(Function<T, Collection<U>>) was not
>> selected, despite being simpler to understand (and convenient in the
>> common but by-no-means universal case where you already have a
>> Collection lying around.)
>>
>> Whether parallelization will be a win or is tricky, as you've
>> discovered. Effectively, your stream data is generated by explode(),
>> not by the stream source, and this process is fundamentally serial. So
>> you will get benefit out of parallelization only if the work done to
>> process the files (which could be as trivial as processing their names,
>> or as huge as parsing gigabyte XML documents) is big enough to make up
>> for the serial part. Unfortunately, the library can't figure that out
>> for you, only you know that.
>>
>> That said, its interesting that you see different parallelization by
>> adding the intermediate collect() stage. This may well be a bug, we'll
>> look into that.
>>
>> On 1/27/2013 8:31 AM, Arne Siegel wrote:
>>> Hi lambda nation,
>>>
>>> was wondering if I could lambda-ize this little file system crawler:
>>>
>>> public void forAllFilesDo(Block<File> fileConsumer, File baseFile) {
>>> if (baseFile.isHidden()) {
>>> // do nothing
>>> } else if (baseFile.isFile()) {
>>> fileConsumer.accept(baseFile);
>>> } else if (baseFile.isDirectory()) {
>>> File[] files = baseFile.listFiles();
>>> Arrays.stream(files).forEach(f -> forAllFilesDo(fileConsumer, f));
>>> }
>>> }
>>>
>>> Using b74, the following reformulation produces the identical behaviour:
>>>
>>> public void forAllFilesDo3(Block<File> fileConsumer, File baseFile) {
>>> BiBlock<Stream.Downstream<File>,File> fileExploder = (ds, f) -> {
>>> if (f.isHidden()) {
>>> // do nothing
>>> } else if (f.isFile()) {
>>> ds.send(f);
>>> } else if (f.isDirectory()) {
>>> Arrays.stream(f.listFiles())
>>> .forEach(fi -> fileExploder(ds, fi)); /* recursive call ! */
>>> }
>>> };
>>> Arrays.stream(new File[] {baseFile})
>>> .explode(fileExploder)
>>> .forEach(fileConsumer::accept);
>>> }
>>>
>>> As my next step I was looking if I could parallelize the last step. But simply inserting .parallel()
>>> seems not to be working - presumably because the initial one-element Spliterator is
>>> governing split behaviour.
>>>
>>> Inserting .collect(Collectors.<File>toList()).parallelStream() instead works fine:
>>>
>>> Arrays.stream(new File[] {baseFile})
>>> .explode(fileExploder)
>>> .collect(Collectors.<File>toList())
>>> .parallelStream()
>>> .forEach(fileConsumer::accept);
>>>
>>> Main observations:
>>>
>>> (*) Exploding a single element requires a lot of programming overhead:
>>> 1st step - creation of a one-element array or collection
>>> 2nd step - stream the result of the 1st step
>>> 3rd step - explode the result of the 2nd step
>>> Not sure if anything can be done about this.
>>>
>>> (*) .parallel() after .explode() behaves unexpectedly. Maybe the downstream should be based
>>> on a completely fresh unknown-size spliterator.
>>>
>>> Regards
>>> Arne Siegel
>>
>
More information about the lambda-dev
mailing list