Streams, parallelization, and OOME.

Tue Oct 22 05:10:36 UTC 2024

Thanks Viktor, this was what I was looking for.

Ok, so due to existing difficulties in translating push to pull, it makes
sense why this wouldn't work.

I really look forward to upgrading my codebase soon. Gatherers fixed window
completely negates this problem. At least, it appears to.

On Mon, Oct 21, 2024, 6:54 AM Viktor Klang <viktor.klang at oracle.com> wrote:

> Hi David,
>
> Stream::spliterator() and Stream::iterator() suffer from inherent
> limitations (see https://bugs.openjdk.org/browse/JDK-8268483 ) because
> they attempt to convert push-style streams into pull-style constructs
> (Iterator, Spliterator). Since the only way to know if there's something to
> pull is for something to get pushed, but who's doing to pushing (answer:
> the same one who tries to do the pulling)?
>
> For sequential Streams this is often as problematic, but for parallel
> streams it's not the caller which evaluates the stream but rather a
> task-tree submitted to a ForkJoinPool. You can see the implementation here:
> https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/util/stream/StreamSpliterators.java#L272
>
> As an example, Stream::flatMap(…) had corner cases where its use of
> Stream::spliterator() would end up not terminating, which was fixed in Java
> 23: https://github.com/openjdk/jdk/pull/18625
> <https://github.com/openjdk/jdk/pull/18625>
> 8196106: Support nested infinite or recursive flat mapped streams by
> viktorklang-ora · Pull Request #18625 · openjdk/jdk
> <https://github.com/openjdk/jdk/pull/18625>
> This PR implements Gatherer-inspired encoding of flatMap that shows that
> it is both competitive performance-wise as well as improve correctness.
> Below is the performance of Stream::flatMap (for ref...
> github.com
>
>
>
>
> Cheers,
> √
>
>
> *Viktor Klang*
> Software Architect, Java Platform Group
> Oracle
> ------------------------------
> *From:* core-libs-dev <core-libs-dev-retn at openjdk.org> on behalf of David
> Alayachew <davidalayachew at gmail.com>
> *Sent:* Saturday, 19 October 2024 07:54
> *To:* core-libs-dev <core-libs-dev at openjdk.org>
> *Subject:* Streams, parallelization, and OOME.
>
> Hello Core Libs Dev Team,
>
> I have a file that I am streaming from a service, and I am trying to split
> into multiple parts based on a certain attribute found on each line. I am
> sending each part up to a different service.
>
> I am using BufferedReader.lines(). However, I cannot read the whole file
> into memory because it is larger than the amount of RAM that I have on the
> machine. So, since I don't have access to Java 22's Preview Gatherers Fixed
> Window, I used the iterator() method on my stream, wrapped that in another
> iterator that can grab my batch size worth of data, then built a
> spliterator from that that I then used to create a new stream. In short,
> this wrapper iterator isn't Iterator<T>, it's Iterator<List<T>>.
>
> When I ran this sequentially, everything worked well. However, my CPU was
> low and we definitely have a performance problem -- our team needs this
> number as fast as we can get. Plus, we had plenty of network bandwidth to
> spare, so I had (imo) good reason to go use parallelism.
>
> As soon as I turned on parallelism, the stream's behaviour changed
> completely. Instead of fetching the batch and processing, it started
> grabbing SEVERAL BATCHES and processing NONE OF THEM. Or at the very least,
> it grabbed so many batches that it ran out of memory before it could get to
> processing them.
>
> To give some numbers, this is a 4 core machine. And we can safely hold
> about 30-40 batches worth of data in memory before crashing. But again,
> when running sequentially, this thing only grabs 1 batch, processes that
> one batch, sends out the results, and then start the next one, all as
> expected. I thought that adding parallelism would simply make it so that we
> have this happening 4 or 8 times at once.
>
> After a very long period of digging, I managed to find this link.
>
>
> https://stackoverflow.com/questions/30825708/java-8-using-parallel-in-a-stream-causes-oom-error
>
> Tagir Valeev gives an answer which doesn't go very deep into the "why" at
> all. And the answer is more directed to the user's specific question as
> opposed to solving this particular problem.
>
> After digging through a bunch of other solutions (plus my own testing), it
> seems that the answer is that the engine that does parallelization for
> Streams tries to grab a large enough "buffer" before doing any parallel
> processing. I could be wrong, and how large that buffer is? I have no idea.
>
> Regardless, that's about where I gave up and went sequential, since the
> clock was ticking.
>
> But I still have a performance problem. How would one suggest going about
> this in Java 8?
>
> Thank you for your time and help.
> David Alayachew
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/core-libs-dev/attachments/20241022/1608d28b/attachment.htm>