Optimizing arithmetic operations on processors with AVX2 support

Thu Dec 8 10:08:12 PST 2011

Such an assumption (in-order execution of statements) would be invalid even with the current memory model. There's nothing to stop the compilers from re-ordering the adds and multiplies so that they fill each other's pipeline delays.

So I don't think AVX2 brings anything new to the table in terms of perturbing the memory model.

----- Original Message -----
From: "John Platts" <john_platts at hotmail.com>
To: jdk8-dev at openjdk.java.net
Sent: Thursday, December 8, 2011 9:25:41 AM
Subject: Optimizing arithmetic operations on processors with AVX2 support

Here is an example of a class with an operation that can be optimized on a processor with AVX2 support:class ExampleClass {    public void ExampleOperation(ExampleClass y) {        a += y.a;        b *= y.b;        c += y.c;        d += y.d;        e += y.e;        f *= y.f;        g *= y.g;        h *= y.h;    }
    private int a;    private int b;    private int c;    private int d;    private int e;    private int f;    private int g;    private int h;}
The AVX2 instruction set includes gather instructions that can be used to read from primitive fields that are not contiguous to each other. The AVX2 instruction set will be implemented on the Intel Haswell microarchitecture processors.
In the example above, a JVM running on a processor with the AVX2 instruction set can optimize the ExampleOperation method as follows:- Reading the a, c, d, and e fields of both this and y using the VPGATHERDD instruction.- Performing the 4 addition operations simultaneously using the PADDD instruction.- Store the result of the addition operations in a, c, d, and e using the PEXTRD instruction.- Reading the b, f, g, and h fields of both this and y using the VPGATHERDD instruction.- Performing the 4 multiplication operations simultaneously using the PMULLD instruction.- Store the result of the multiplication operations in b, f, g, and h using the PEXTRD instruction.
This optimization is perfectly legal under the Java Memory Model, since there are no volatile reads or volatile writes. However, this optimization would be illegal if a, b, c, d, e, f, g, or h were declared as volatile fields. This optimization must also respect constraints imposed by synchronized blocks, volatile reads, volatile writes, method calls, data dependencies, and strictfp semantics. This optimization would also need to be disabled if the method is being debugged by a Java debugger, as the Java debugger can step through each operation individually.
The point I am trying to illustrate is that Java programmers should not assume that the arithmetic operations performed by the ExampleOperation method are not guaranteed to execute in the sequence shown in the source code. This example also illustrates the importance of properly synchronization. Will this optimization get implemented in the Hotspot VM in the future?