Numerical Stream code

Brian Goetz brian.goetz at oracle.com
Thu Feb 14 06:37:55 PST 2013


Some things here:

seq stream vs c-style: You've manually inlined to() in your C style 
version, whereas you're dispatching through a lambda in the stream 
version.  Not quite a fair comparison now, is it?  (Though the VM will 
probably still at this time have an easier time inlining through the C 
version of a fairer comparison.)

average() is doing a lot more work (~4x) than just averaging.  See the 
source code.

The parallel version is almost certainly suffering false cache line 
sharing when adjacent tasks are writing to the shared arrays u0, etc. 
Nothing to do with streams, just a standard parallelism gotcha.



On 2/14/2013 1:34 AM, Howard Lovatt wrote:
> Hi,
>
> I have been trying out lambdas on:
>
> openjdk version "1.8.0-ea"
> OpenJDK Runtime Environment (build
> 1.8.0-ea-lambda-nightly-h3307-20130211-b77-b00)
> OpenJDK 64-Bit Server VM (build 25.0-b15, mixed mode)
>
> To see if scientific type numerical code can use Streams. I wrote a
> synthetic benchmark that applies a kernel repeatedly over time and space to
> solve a diffusion equation in 1 D, e.g. heat diffusing into a metal rod
> from either end. The core of the code is:
>
>    private enum Styles implements Style {
>      CLike {
>        @Override public double run() {
>          uM1[0] = uT0; // t = 0
>          for (int xi = 1; xi < numXs - 1; xi++) { uM1[xi] = u0X; }
>          uM1[numXs - 1] = uT1;
>          for (int ti = 1; ti < numTs; ti++, uTemp = uM1, uM1 = u0, u0 =
> uTemp) { // t > 0
>            u0[0] = uT0; // x = 0
>            for (int xi = 1; xi < numXs - 1; xi++) { u0[xi] =
> explicitFDM.u00(uM1[xi - 1], uM1[xi], uM1[xi + 1]); } // 0 < x < 1
>            u0[numXs - 1] = uT1; // x = 1
>          }
>          double sum = 0; // Calculate average of last us
>          for (final double u : uM1) { sum += u; }
>          return sum / numXs;
>        }
>      },
>
>      SerialStream {
>        @Override public double run() {
>          Arrays.indices(uM1).forEach(this::t0);
>          for (int ti = 1; ti < numTs; ti++, uTemp = uM1, uM1 = u0, u0 =
> uTemp) { // t > 0
>            Arrays.indices(uM1).forEach(this::tg0);
>          }
>          return Arrays.stream(uM1).average().getAsDouble(); // Really slow!
>        }
>      },
>
>      ParallelStream {
>        @Override public double run() {
>          Arrays.indices(uM1).parallel().forEach(this::t0);
>          for (int ti = 1; ti < numTs; ti++, uTemp = uM1, uM1 = u0, u0 =
> uTemp) { // t > 0
>            Arrays.indices(uM1).parallel().forEach(this::tg0);
>          }
>          return Arrays.stream(uM1).parallel().average().getAsDouble(); //
> Really really slow!!
>        }
>      };
>
>      double[] u0 = new double[numXs];
>      double[] uM1 = new double[numXs];
>      double[] uTemp = null;
>
>      void t0(final int xi) {
>        if (xi == 0) { uM1[0] = uT0; }
>        else if (xi == numXs - 1) { uM1[numXs - 1] = uT1; }
>        else { uM1[xi] = u0X; }
>      }
>
>      void tg0(final int xi) {
>        if (xi == 0) { u0[0] = uT0; }
>        else if (xi == numXs - 1) { u0[numXs - 1] = uT1; }
>        else { u0[xi] = explicitFDM.u00(uM1[xi - 1], uM1[xi], uM1[xi + 1]); }
>      }
>    }
>
> And when run it produces:
>
> CLike: time = 2351 ms, result = 99.99581170383331
> SerialStream: time = 20532 ms, result = 99.99581170383331
> ParallelStream: time = 131317 ms, result = 99.99581170383331
>
> The slowness is a pity because the coding comes out quite well!
>
> I wasn't particularly expecting the Stream implementation to be fast,
> because they are a work in progress after all. However a factor of almost
> 10 for the serial case and over 50 for the parallel case seems excessive. I
> therefore suspect that I am doing something wrong.
>
> Can anyone enlighten me?
>
> Thanks,
>
>    -- Howard.
>


More information about the lambda-dev mailing list