[PATCH] 8217561 : X86: Add floating-point Math.min/max intrinsics, approval request

Mon Feb 18 02:17:50 UTC 2019

Hi Bernard, 

Thanks for your comments, please find my responses in below mail.

Regards,
Jatin

> -----Original Message-----
> From: B. Blaser [mailto:bsrbnd at gmail.com]
> Sent: Sunday, February 17, 2019 9:06 PM
> To: Bhateja, Jatin <jatin.bhateja at intel.com>
> Cc: Vladimir Kozlov <vladimir.kozlov at oracle.com>; Nils Eliasson
> <nils.eliasson at oracle.com>; Viswanathan, Sandhya
> <sandhya.viswanathan at intel.com>; hotspot-compiler-
> dev at openjdk.java.net; eric.caspole at oracle.com
> Subject: Re: [PATCH] 8217561 : X86: Add floating-point Math.min/max
> intrinsics, approval request
> 
> Hi Jatin, Vladimir, All,
> 
> On Sun, 17 Feb 2019 at 10:02, Bhateja, Jatin <jatin.bhateja at intel.com>
> wrote:
> >
> > Hi Vladimir,
> >
> > I have added a benchmark over the revision shared by Bernard
> >
> > Following is the link to updated patch.
> >
> > http://cr.openjdk.java.net/~sviswanathan/Jatin/8217561/webrev.04/
> >
> > Regards,
> > Jatin
> >
> > > -----Original Message-----
> > > From: Vladimir Kozlov [mailto:vladimir.kozlov at oracle.com]
> > > Sent: Saturday, February 16, 2019 7:19 AM
> > >
> > > Eric C. asked to add JMH microbanchmark to show performance of new
> > > code. You may be missed his e-mail:
> > >
> > > https://mail.openjdk.java.net/pipermail/hotspot-compiler-dev/2019-
> > > February/032830.html
> > >
> > > We need to see benefits from these changes and verify them before
> > > pushing it.
> > >
> > > Thanks,
> > > Vladimir
> 
> Thanks for your benchmark, Jatin, which is a good start.
> 
> Unfortunately, it's biased because the loops are far too much predictable
I agree. Which is why to counter the predictability special values (signed zeros , NaNs ) were interleaved in between strict floating point values since original implementation 
exists early in such scenarios.  

> and modulo operations very expensive vs min/max.

I agree that modulo is expensive than min/max but it has been used during setup phase only. 

> Therefore, the results don't seem to be really representative.
> 
> The legal header looks also unusual and please use 4 spaces indentation for
> Java files.
> 
> I wrote another benchmark, here under and also attached, to be added to
> webrev.04.
> 
> Here are the results on x86_64 (xeon), before:
> 
> $ make test TEST="micro:vm.compiler.FpMinMaxIntrinsics"
> MICRO="VM_OPTIONS='-
> XX:CompileCommand=print,org/openjdk/bench/vm/compiler/FpMinMaxIntr
> insics.*Bench*
> -XX:-InlineMathNatives';FORK=1;WARMUP_ITER=1;ITER=3"
> 
> 089       movsd   XMM0, [R10 + #16 + RAX << #3]    # double
> [...]
> 099       movsd   XMM1, [R10 + #16 + R8 << #3]    # double
> [...]
> 0c0       ucomisd XMM0, XMM1 test
> 
> 
> Benchmark                Mode  Cnt     Score      Error  Units
> FpMinMaxIntrinsics.dMax  avgt    3  8786.771 ± 1853.250  ns/op
> FpMinMaxIntrinsics.dMin  avgt    3  9066.529 ± 1257.368  ns/op
> FpMinMaxIntrinsics.fMax  avgt    3  8655.304 ± 2056.253  ns/op
> FpMinMaxIntrinsics.fMin  avgt    3  8639.211 ± 1400.143  ns/op
> 
> 
> and after:
> 
> $ make test TEST="micro:vm.compiler.FpMinMaxIntrinsics"
> MICRO="VM_OPTIONS='-
> XX:CompileCommand=print,org/openjdk/bench/vm/compiler/FpMinMaxIntr
> insics.*Bench*
> -XX:+InlineMathNatives';FORK=1;WARMUP_ITER=1;ITER=3"
> 
> 085       movsd   XMM4, [R11 + #16 + RAX << #3]    # double
> [...]
> 095       movsd   XMM0, [R11 + #16 + R8 << #3]    # double
> [...]
> 09c       blendvpd         XMM2,XMM0,XMM4,XMM0
> 09c       blendvpd         XMM1,XMM4,XMM0,XMM0
> 09c       vmaxsd           XMM3,XMM1,XMM2
> 09c       cmppd.unordered  XMM2,XMM1,XMM1
> 09c       blendvpd         XMM0,XMM3,XMM1,XMM2
> 
> 
> Benchmark                Mode  Cnt     Score      Error  Units
> FpMinMaxIntrinsics.dMax  avgt    3  3971.653 ±  855.886  ns/op
> FpMinMaxIntrinsics.dMin  avgt    3  3787.000 ±  524.271  ns/op
> FpMinMaxIntrinsics.fMax  avgt    3  3786.412 ± 1013.097  ns/op
> FpMinMaxIntrinsics.fMin  avgt    3  4051.351 ± 2327.196  ns/op
> 
> 
> We see that the new instruction sequence is about 2x faster than the
> previous one.
> Could we have a Reviewer feedback and eventually a definitive approval for
> this?
> 
> Thanks,
> Bernard
> 
> 
> package org.openjdk.bench.vm.compiler;
> 
> import org.openjdk.jmh.annotations.*;
> import org.openjdk.jmh.infra.*;
> 
> import java.util.concurrent.TimeUnit;
> import java.util.Random;
> 
> @BenchmarkMode(Mode.AverageTime)
> @OutputTimeUnit(TimeUnit.NANOSECONDS)
> @State(Scope.Thread)
> public class FpMinMaxIntrinsics {
>     private static final int COUNT = 1000;
> 
>     private double[] doubles = new double[COUNT];
>     private float[] floats = new float[COUNT];
> 
>     private int c1, c2, s1, s2;
> 
>     private Random r = new Random();
> 
>     @Setup
>     public void init() {
>         c1 = s1 = step();
>         c2 = COUNT - (s2 = step());
> 
>         for (int i=0; i<COUNT; i++) {
>             floats[i] = r.nextFloat();
>             doubles[i] = r.nextDouble();
>         }
>     }
> 
>     private int step() {
>         return (r.nextInt() & 0xf) + 1;
>     }
> 
>     @Benchmark
>     public void dMax(Blackhole bh) {
>         for (int i=0; i<COUNT; i++)
>             bh.consume(dMaxBench());
>     }
> 
>     @Benchmark
>     public void dMin(Blackhole bh) {
>         for (int i=0; i<COUNT; i++)
>             bh.consume(dMinBench());
>     }
> 
>     @Benchmark
>     public void fMax(Blackhole bh) {
>         for (int i=0; i<COUNT; i++)
>             bh.consume(fMaxBench());
>     }
> 
>     @Benchmark
>     public void fMin(Blackhole bh) {
>         for (int i=0; i<COUNT; i++)
>             bh.consume(fMinBench());
>     }
> 
Compilation logs show that [fd]MaxBench* methods are getting inlined which is expected.
This means that call to function inc() will also get benchmarked which looks like an overhead.

>     private double dMaxBench() {
>         inc();
>         return Math.max(doubles[c1], doubles[c2]);
>     }
> 
>     private double dMinBench() {
>         inc();
>         return Math.min(doubles[c1], doubles[c2]);
>     }
> 
>     private float fMaxBench() {
>         inc();
>         return Math.max(floats[c1], floats[c2]);
>     }
> 
>     private float fMinBench() {
>         inc();
>         return Math.min(floats[c1], floats[c2]);
>     }
> 
>     private void inc() {
>         c1 = c1 + s1 < COUNT ? c1 + s1 : (s1 = step());
>         c2 = c2 - s2 > 0 ? c2 - s2 : COUNT - (s2 = step());
>     }
> }