Performance of FloatVector::pow vs. equivalent FloatVector::mul (oracle-jdk-19.0.2, Intel i7 8700B)

Tue Feb 28 20:43:34 UTC 2023

And replaying to list :-)

—

Yes, that's certainly possible, mirroring what C2 does for scalar operations with constants 2.0 and 0.5:

 https://github.com/openjdk/jdk/blob/master/src/hotspot/share/opto/library_call.cpp#L1753

(Its a little more challenging to determine a vector whose lane elements are all the same and constant, but it's possible.)

Paul.

> On Feb 28, 2023, at 12:22 PM, Dirk Toewe <dirktoewe at gmail.com> wrote:
> 
> Sorry for forgetting to include the mailing list.
> 
> ---------- Forwarded message ---------
> Von: Dirk Toewe <dirktoewe at gmail.com>
> Date: Di., 28. Feb. 2023 um 21:20 Uhr
> Subject: Re: Performance of FloatVector::pow vs. equivalent FloatVector::mul (oracle-jdk-19.0.2, Intel i7 8700B)
> To: Paul Sandoz <paul.sandoz at oracle.com>
> 
> 
> Hi,
> 
> if the exponent is a compile time constant integer, it would be really useful if either the compiler or the JIT would optimize away with it. Wouldn't this still beat Intel's Vector Math Library (unless the JIT does some template/macro magic)?
> 
> Regards
> Dirk
> 
> Am Di., 28. Feb. 2023 um 21:08 Uhr schrieb Paul Sandoz <paul.sandoz at oracle.com>:
> Hi Alex,
> 
> The performance difference you observe is because the pow operation is falling back to scalar code (Math.pow on each lane element) and not using vector instructions.
> 
> On x86 linux or windows you should observe better performance of the pow operation because it should leverage code from Intel’s Short Vector Math Library [1], but that code OS specific and is not currently ported on Mac OS.
> 
> Paul.
> 
> [1] 
> https://github.com/openjdk/jdk/tree/master/src/jdk.incubator.vector/linux/native/libjsvml
> https://github.com/openjdk/jdk/tree/master/src/jdk.incubator.vector/windows/native/libjsvml
> 
> > On Feb 26, 2023, at 8:01 PM, Alex K <aklibisz at gmail.com> wrote:
> > 
> > Hello,
> > 
> > I have a question, possibly a bug, to ask/report regarding performance with the jdk.incubator.vector.FloatVector class.
> > 
> > Specifically, given a FloatVector fv, I've found that calling fv.mul(fv) is ~40x faster than calling fv.pow(2)
> > 
> > Here is an example JMH benchmark: https://github.com/alexklibisz/site-projects/blob/782dcd53d3ee09c93f65b660c8ed4fd030a8889a/jdk-incubator-vector-optimizations/src/main/scala/com/alexklibisz/BenchPowVsMul.scala
> > 
> > The results look like this, on my 2018 Mac Mini, w/ Intel i7 8700B, running oracle-jdk-19.0.2:
> > 
> > <image.png>
> > 
> > As far as I can tell, the two operations produce equivalent results, yet one is significantly faster than the other.
> > 
> > I'm eager to learn if this is expected, a regression, or something else.
> > 
> > Thanks,
> > Alex Klibisz
>