experience trying out lambda-8-b74
Gregg Wonderly
gregg at wonderly.org
Sun Jan 27 12:45:28 PST 2013
One of the problems I have, is an application which can end up with 10's of thousands of files in a single directory. I tried to get an iterator with call backs or opendir/readdir put into the JDK back in 1.3.1/1.4 to no avail. I'd real like to have something as simple as opendir/readdir provided as a stream that is effective and non memory intensive. I currently have to use JNI to do opendir/readdir, and I need to have a count function that used readdir or an appropriate mechanism to see when certain thresholds are passed.
Gregg Wonderly
Sent from my iPad
On Jan 27, 2013, at 12:47 PM, Brian Goetz <brian.goetz at oracle.com> wrote:
> Thanks, this is a nice little example.
>
> First, I'll note we have been brainstorming on stream support for NIO
> directory streams, which might help in this situation. This should come
> up for review here pretty soon, and it would be great if you could
> validate whether it addresses your concerns here.
>
> The "recursive explosion" problem is an interesting one, and not one
> we'd explicitly baked into the design requirements for
> flatMap/mapMulti/explode. (Note that Scala's flatMap, or .NET's
> SelectMany, do not explicitly handle this either.) But your solution is
> pretty clean anyway! You can lower the overhead you are concerned about
> by not creating a Stream but instead forEaching directly on the result
> of listFiles:
>
> } else if (f.isDirectory()) {
> for (fi : f.listFiles())
> fileExploder(ds, fi); /* recursive call ! */
> }
>
> Which isn't as bad. You are still creating the array inside
> listFiles(), but at least then you create no intermediate objects to
> iterate over it / explode it.
>
> BTW, the current (frequently confusing) design of explode is based on
> the exact performance concern you are worried about; exploding an
> element to nothing should result in no work. Creating an empty
> collection, and then iterating it, would be a loss, which is why
> something that is more like map(Function<T, Collection<U>>) was not
> selected, despite being simpler to understand (and convenient in the
> common but by-no-means universal case where you already have a
> Collection lying around.)
>
> Whether parallelization will be a win or is tricky, as you've
> discovered. Effectively, your stream data is generated by explode(),
> not by the stream source, and this process is fundamentally serial. So
> you will get benefit out of parallelization only if the work done to
> process the files (which could be as trivial as processing their names,
> or as huge as parsing gigabyte XML documents) is big enough to make up
> for the serial part. Unfortunately, the library can't figure that out
> for you, only you know that.
>
> That said, its interesting that you see different parallelization by
> adding the intermediate collect() stage. This may well be a bug, we'll
> look into that.
>
> On 1/27/2013 8:31 AM, Arne Siegel wrote:
>> Hi lambda nation,
>>
>> was wondering if I could lambda-ize this little file system crawler:
>>
>> public void forAllFilesDo(Block<File> fileConsumer, File baseFile) {
>> if (baseFile.isHidden()) {
>> // do nothing
>> } else if (baseFile.isFile()) {
>> fileConsumer.accept(baseFile);
>> } else if (baseFile.isDirectory()) {
>> File[] files = baseFile.listFiles();
>> Arrays.stream(files).forEach(f -> forAllFilesDo(fileConsumer, f));
>> }
>> }
>>
>> Using b74, the following reformulation produces the identical behaviour:
>>
>> public void forAllFilesDo3(Block<File> fileConsumer, File baseFile) {
>> BiBlock<Stream.Downstream<File>,File> fileExploder = (ds, f) -> {
>> if (f.isHidden()) {
>> // do nothing
>> } else if (f.isFile()) {
>> ds.send(f);
>> } else if (f.isDirectory()) {
>> Arrays.stream(f.listFiles())
>> .forEach(fi -> fileExploder(ds, fi)); /* recursive call ! */
>> }
>> };
>> Arrays.stream(new File[] {baseFile})
>> .explode(fileExploder)
>> .forEach(fileConsumer::accept);
>> }
>>
>> As my next step I was looking if I could parallelize the last step. But simply inserting .parallel()
>> seems not to be working - presumably because the initial one-element Spliterator is
>> governing split behaviour.
>>
>> Inserting .collect(Collectors.<File>toList()).parallelStream() instead works fine:
>>
>> Arrays.stream(new File[] {baseFile})
>> .explode(fileExploder)
>> .collect(Collectors.<File>toList())
>> .parallelStream()
>> .forEach(fileConsumer::accept);
>>
>> Main observations:
>>
>> (*) Exploding a single element requires a lot of programming overhead:
>> 1st step - creation of a one-element array or collection
>> 2nd step - stream the result of the 1st step
>> 3rd step - explode the result of the 2nd step
>> Not sure if anything can be done about this.
>>
>> (*) .parallel() after .explode() behaves unexpectedly. Maybe the downstream should be based
>> on a completely fresh unknown-size spliterator.
>>
>> Regards
>> Arne Siegel
>
More information about the lambda-dev
mailing list