Regression (b92): OOM on filter/limit operation on unbound/parallel stream

Tue Jun 4 08:15:18 PDT 2013

First of all adding -Xint seems to make the problem to go away for me as well: using a parallel stream will not use much more heap than a serial stream _and_ there is a speed up 

When not using -Xint the output from the isPrime predicate is reporting that increasingly bigger n are tested way beyond the 4,000,000 from the serial stream case (GC output removed):

10.077: 131349505
74.073: 137916980
141.865: 144812829
214.485: 152053470
292.511: 159656143
...
...
...

During that time heap usage slowly creaps up until the heap is completely filled and an OOM occurs for me.

Christian

-----Original Message-----
From: Paul Sandoz [mailto:paul.sandoz at oracle.com] 
Sent: Tuesday, June 04, 2013 3:17 PM
To: Mallwitz, Christian
Cc: 'lambda-dev at openjdk.java.net'
Subject: Re: Regression (b92): OOM on filter/limit operation on unbound/parallel stream

Hi Christian,

I can reproduce your behaviour of your test programming taking ages to run, but i needed to increase mx to 2g. Initially i notice lots of GCs then it starts slowing down with the occasional "GC (Allocation Failure)". Something fishy is going on here, it is very slow with one core pegged at 100% and Visual VM is showing that the used heap is way less than the heap size.

If i run using interpretive mode -Xint, it works fine, suggesting we are running into a hotspot issue, perhaps a hotspot/GC issue:

$ java -Xint -mx1g -XX:+PrintCommandLineFlags -verbose:gc -XX:+PrintGCTimeStamps OOM2
-XX:ClassMetaspaceSize=104857600 -XX:InitialHeapSize=134217728 -XX:MaxHeapSize=1073741824 -XX:+PrintCommandLineFlags -XX:+PrintGC -XX:+PrintGCTimeStamps -XX:+UseCompressedKlassPointers -XX:+UseCompressedOops -XX:+UseParallelGC 
3.766: [GC (Allocation Failure)  32767K->656K(125632K), 0.0010730 secs]
8.476: [GC (Allocation Failure)  33424K->592K(125632K), 0.0008800 secs]
13.520: 3999970
firstNPrimes (  283146, false):   283146
13.623: [GC (Allocation Failure)  33360K->960K(125632K), 0.0010140 secs]
14.019: [GC (Allocation Failure)  33728K->6244K(125632K), 0.0035470 secs]
14.530: [GC (Allocation Failure)  39004K->11608K(124864K), 0.0050110 secs]
15.034: [GC (Allocation Failure)  43608K->23289K(123776K), 0.0071630 secs]
15.499: [GC (Allocation Failure)  48175K->27989K(129600K), 0.0033540 secs]
15.946: [GC (Allocation Failure)  53141K->38138K(126336K), 0.0083030 secs]
15.878: 4199968
16.043: 4475905
16.416: [GC (Allocation Failure)  57018K->24262K(111040K), 0.0027570 secs]
firstNPrimes (  283146,  true):   283146

>From the GC output it looks like an ~ speed up of about 3-4x.

I can even run with a lower mx value. Interestingly if i lower mx to say 32m i start running into similar issues. I have noticed GC output continually outputting a full GC.

--

Because iterator is an ordered and infinite source then hooking it up to limit will always requiring buffering when going parallel. We have managed to reduce some of the internal memory churn but this really comes down to better documentation.

To obtain 283_146 primes at least 3_999_971 elements will need to be buffered. The LongStream.iterate method extracts parallelism by sequentially copying a prefix of elements into an array and growing the array over time. Each array will be operated on by an F/J task. To get any parallel speed up the cost of processing elements must be higher than copying those elements into an array; i think primes is a good example of where this should work well.

Note that a limit up to N does not guarantee that up to N elements will be processed upstream, it may be more, specially so given the array sizes increasing of iterate.

Using boxed values is definitely gonna kill performance since it will mess up the caches, while the array of Long[] is OK the actually long values could be anywhere in memory.

Paul.

On Jun 4, 2013, at 2:02 PM, "Mallwitz, Christian" <christian.mallwitz at commerzbank.com> wrote:

> Hi,
> 
> Re-reporting problem for build 92: the filter/limit on unbound, parallel stream operation behaves very weird. How weird? Well:
> 
> - the serial version uses about 16M of heap, the parallel gobbles up all of the 1GB defined as max
> - it throws an OOM after running for hours while it is __NOT__ garbage collecting much but
> - the filter predicate is called on millions of elements more than what can be observed using a serial stream

> - I can't observe any parallelism (based on number of cores and system load - 4 cores and only 25% load)
> 
> I'm using 1.8.0-ea-lambda-nightly-h4544-20130526-b92-b00 on a Windows XP 32 bit machine - VM args: -mx1g -XX:+PrintCommandLineFlags -verbose:gc -XX:+PrintGCTimeStamps
> 
> My test program tries to find the first 283,146 prime numbers (which are all below 4,000,000). It executes the filter/limit operation on a serial stream first and prints out numbers tested for primality occasionally.
> 
> Thanks
> Christian
> 
> 
> 
> 
> 
> package com.snuffbumble.lambda;
> 
> import java.util.stream.LongStream;
> import java.util.stream.Stream;
> 
> public class OOM {
> 
>    public static void main(String... ignored) {
> 
>        // first 283_146 primes are all < 4_000_000
> 
>        firstNPrimes(283_146, getIteratorBasedStream(false)); // only minor GC - max heap 16M
>        firstNPrimes(283_146, getIteratorBasedStream(true));  // full GC - up to max heap 1GB and subsequent OOM
>    }
> 
>    private static Stream<Long> getIteratorBasedStream(boolean produceParallelStream) {
>        Stream<Long> s = LongStream.iterate(1L, n -> n + 1L).boxed();
>        return produceParallelStream ? s.parallel() : s;
>    }
> 
>    private static void firstNPrimes(int n, Stream<Long> stream) {
>        System.out.println(String.format("firstNPrimes (%8d, %5b): %8d", n, stream.isParallel(),
>                stream
>                        .filter(OOM::isPrime)
>                        .limit(n) // limit after primality test
>                        .count()));
>    }
> 
>    private static long start_millis = System.currentTimeMillis();
>    private static long max_n_seen = 3_999_970L;
> 
>    private static boolean isPrime(long n) {
> 
>        if (n >= max_n_seen) {
>            System.out.println(String.format("%.3f: %d", (System.currentTimeMillis() - start_millis)/1000.0, n));
>            max_n_seen = n + (n/20L); // new max_n_seen 5% larger than current n
>        }
> 
>        if (n <= 1) { return false; }
>        if (n == 2) { return true; }
>        if (n % 2 == 0) { return false; }
>        for (int i = 3; i <= (int) Math.sqrt(n); i += 2) { if (n % i == 0) return false; }
>        return true;
>    }
> }
> 
>