<div dir="auto"><p dir="ltr">But *not letting any of them through into the forEach loop.</p></div>
<br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Oct 19, 2024, 9:36 AM David Alayachew <<a href="mailto:davidalayachew@gmail.com" target="_blank" rel="noreferrer">davidalayachew@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><p dir="ltr">So to be clear, I added a logger to my BufferedReader. So I know for a fact that it is reading data.</p>
<p dir="ltr">And as for the code, it is a very simple parallel forEach.</p>
<p dir="ltr">someStream.parallel().forEach(**work**);</p>
<p dir="ltr">I only wanted to change the execution from sequential to parallel.</p>
<p dir="ltr">So I have millions and millions of lines being read from the source (1 million lines is several batches alone) and none of it is entering the forEach loop. And I know this because the very first thing my forEach loop does is print logs.</p>
<p dir="ltr">As for the iterator, it is literally just a hand rolled iterator (that I copied from StackOverflow) where there is an instance field holding a list of elements, sized to match my batch size. hasNext() prepares the next batch and stores it into that instance field, then returns true/false if there is data to send, then next() returns that instance field's prepared batch. Preparing is just making a new list, and storing a batch size's worth of elements in it.</p>
<p dir="ltr">So this means that it is just grabbing batches and batches, but letting any of them through into the forEach loop. And when I turn off parallelism, they suddenly are going through just fine, one after the other.</p>
<br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Oct 19, 2024, 7:12 AM Olexandr Rotan <<a href="mailto:rotanolexandr842@gmail.com" rel="noreferrer noreferrer" target="_blank">rotanolexandr842@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Hi David. I am not a core libs team but I guess I can have some clues :).<br><br>It is hard to tell without the code, but I assume that there are a few layers to it.<br><br>1. Stalling. I would assume it is caused mostly by GC pauses taking too long (forever) if GC does not have any computational powers to run on. There is a fairly common GC-pauses related issue when database connection interrupts with exception saying "Broken pipe", which under the hood is caused by timeout of connection to database due to long GC pause when running on low memory. I am not saying this is your case, but If I were to guess I would assume that stall is caused by low memory.<br><br>2. Out of memory root cause may be too much splitting of your data source input. You may try to limit it by modifying the behaviour of trySplit method of your spliterator.<br><br>Alternatively, If you don't mind taking up some disk space, you can try to stream data into file, save it, and then use memory-mapped buffers (java.nio.MappedByteBuffer) to process accepted data. I am not sure this will work, but memory-mapped files is a common tool to deal with operations that cant fit into RAM.<br><br>Regards<br><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Oct 19, 2024 at 8:54 AM David Alayachew <<a href="mailto:davidalayachew@gmail.com" rel="noreferrer noreferrer noreferrer" target="_blank">davidalayachew@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto"><div dir="auto">Hello Core Libs Dev Team,</div><div dir="auto"><br></div><div dir="auto">I have a file that I am streaming from a service, and I am trying to split into multiple parts based on a certain attribute found on each line. I am sending each part up to a different service.</div><div dir="auto"><br></div><div dir="auto">I am using BufferedReader.lines(). However, I cannot read the whole file into memory because it is larger than the amount of RAM that I have on the machine. So, since I don't have access to Java 22's Preview Gatherers Fixed Window, I used the iterator() method on my stream, wrapped that in another iterator that can grab my batch size worth of data, then built a spliterator from that that I then used to create a new stream. In short, this wrapper iterator isn't Iterator<T>, it's Iterator<List<T>>.</div><div dir="auto"><br></div><div dir="auto">When I ran this sequentially, everything worked well. However, my CPU was low and we definitely have a performance problem -- our team needs this number as fast as we can get. Plus, we had plenty of network bandwidth to spare, so I had (imo) good reason to go use parallelism.</div><div dir="auto"><br></div><div dir="auto">As soon as I turned on parallelism, the stream's behaviour changed completely. Instead of fetching the batch and processing, it started grabbing SEVERAL BATCHES and processing NONE OF THEM. Or at the very least, it grabbed so many batches that it ran out of memory before it could get to processing them.</div><div dir="auto"><br></div><div dir="auto">To give some numbers, this is a 4 core machine. And we can safely hold about 30-40 batches worth of data in memory before crashing. But again, when running sequentially, this thing only grabs 1 batch, processes that one batch, sends out the results, and then start the next one, all as expected. I thought that adding parallelism would simply make it so that we have this happening 4 or 8 times at once.</div><div dir="auto"><br></div><div dir="auto">After a very long period of digging, I managed to find this link.</div><div dir="auto"><br></div><div dir="auto"><a href="https://stackoverflow.com/questions/30825708/java-8-using-parallel-in-a-stream-causes-oom-error" rel="noreferrer noreferrer noreferrer" target="_blank">https://stackoverflow.com/questions/30825708/java-8-using-parallel-in-a-stream-causes-oom-error</a></div><div dir="auto"><br></div><div dir="auto">Tagir Valeev gives an answer which doesn't go very deep into the "why" at all. And the answer is more directed to the user's specific question as opposed to solving this particular problem.</div><div dir="auto"><br></div><div dir="auto">After digging through a bunch of other solutions (plus my own testing), it seems that the answer is that the engine that does parallelization for Streams tries to grab a large enough "buffer" before doing any parallel processing. I could be wrong, and how large that buffer is? I have no idea.</div><div dir="auto"><br></div><div dir="auto">Regardless, that's about where I gave up and went sequential, since the clock was ticking.</div><div dir="auto"><br></div><div dir="auto">But I still have a performance problem. How would one suggest going about this in Java 8?</div><div dir="auto"><br></div><div dir="auto">Thank you for your time and help.</div><div dir="auto">David Alayachew</div><div dir="auto"><br></div></div>
</blockquote></div>
</blockquote></div>
</blockquote></div>