RFR: 8287925: AArch64: intrinsics for compareUnsigned method in Integer and Long
Hao Sun
haosun at openjdk.org
Mon Dec 5 06:15:54 UTC 2022
On Thu, 1 Dec 2022 11:43:50 GMT, Andrew Haley <aph at openjdk.org> wrote:
> Try this one:
>
> ```
> @Benchmark
> public int compareUnsignedDirect(Blackhole bh) {
> int probe1 = seed, probe2 = seed ^ seed << 5;
> int sum = 0;
> for (int i = 0; i < size; i++) {
> probe1 ^= probe1 << 13;
> sum += Integer.compareUnsigned(probe1, Integer.MAX_VALUE) /* <= 0 ? 1 : 0 */; probe2 ^= probe2 << 13;
> probe1 ^= probe1 >>> 17; sum += Integer.compareUnsigned(probe2, Integer.MAX_VALUE) /* <= 0 ? 1 : 0 */;
> sum += Integer.compareUnsigned(probe1, Integer.MAX_VALUE) /* <= 0 ? 1 : 0 */; probe2 ^= probe2 >>> 17;
> probe1 ^= probe1 << 5; sum += Integer.compareUnsigned(probe2, Integer.MAX_VALUE) /* <= 0 ? 1 : 0 */;
> sum += Integer.compareUnsigned(probe1, Integer.MAX_VALUE) /* <= 0 ? 1 : 0 */; probe2 ^= probe2 << 5;
> sum += Integer.compareUnsigned(probe2, Integer.MAX_VALUE) /* <= 0 ? 1 : 0 */;
> }
> seed = probe2 + probe1;
> return sum;
> }
> ```
>
> Does that help?
Thanks for your comment.
Regarding your example, I made minor updates, shown as below:
diff --git a/test/micro/org/openjdk/bench/java/lang/Integers.java b/test/micro/org/openjdk/bench/java/lang/Integers.java
index 43ceb5d18d2..5ecbee26cab 100644
--- a/test/micro/org/openjdk/bench/java/lang/Integers.java
+++ b/test/micro/org/openjdk/bench/java/lang/Integers.java
@@ -167,6 +167,52 @@ public class Integers {
}
}
+ @Benchmark
+ public void compareUnsignedIndirect3(Blackhole bh) {
+ int seed = intsBig[0];
+ int probe1 = seed;
+ int probe2 = seed ^ seed << 5;
+ int r = 0;
+ for (int i = 0; i < size; i++) {
+ probe1 ^= probe1 << 13;
+ r += Integer.compareUnsigned(probe1, Integer.MAX_VALUE) <= 0 ? 1 : 0;
+ probe2 ^= probe2 << 13;
+ probe1 ^= probe1 >>> 17;
+ r += Integer.compareUnsigned(probe2, Integer.MAX_VALUE) <= 0 ? 1 : 0;
+ r += Integer.compareUnsigned(probe1, Integer.MAX_VALUE) <= 0 ? 1 : 0;
+ probe2 ^= probe2 >>> 17;
+ probe1 ^= probe1 << 5;
+ r += Integer.compareUnsigned(probe2, Integer.MAX_VALUE) <= 0 ? 1 : 0;
+ r += Integer.compareUnsigned(probe1, Integer.MAX_VALUE) <= 0 ? 1 : 0;
+ probe2 ^= probe2 << 5;
+ r += Integer.compareUnsigned(probe2, Integer.MAX_VALUE) <= 0 ? 1 : 0;
+ }
+ bh.consume(r);
+ }
+
+ @Benchmark
+ public void compareUnsignedDirect3(Blackhole bh) {
+ int seed = intsBig[0];
+ int probe1 = seed;
+ int probe2 = seed ^ seed << 5;
+ int r = 0;
+ for (int i = 0; i < size; i++) {
+ probe1 ^= probe1 << 13;
+ r += Integer.compareUnsigned(probe1, Integer.MAX_VALUE);
+ probe2 ^= probe2 << 13;
+ probe1 ^= probe1 >>> 17;
+ r += Integer.compareUnsigned(probe2, Integer.MAX_VALUE);
+ r += Integer.compareUnsigned(probe1, Integer.MAX_VALUE);
+ probe2 ^= probe2 >>> 17;
+ probe1 ^= probe1 << 5;
+ r += Integer.compareUnsigned(probe2, Integer.MAX_VALUE);
+ r += Integer.compareUnsigned(probe1, Integer.MAX_VALUE);
+ probe2 ^= probe2 << 5;
+ r += Integer.compareUnsigned(probe2, Integer.MAX_VALUE);
+ }
+ bh.consume(r);
+ }
+
@Benchmark
public void reverseBytes() {
for (int i = 0; i < size; i++) {
Here shows the performance data of `compareUnsignedDirect3` on aarch64.
Integer.MAX_VALUE
Before After Unit
compareUnsignedDirect3 2.986 ± 0.019 1.858 ± 0.008 us/op
We can see using this intrinsic can introduce some performance uplifts.
Regarding `Integer.compareUnsigned(XX, YY)`, we should use random values for both `XX` and `YY` in order to expose the advantage of branch predication.
Hence, I made the following two versions of updates.
**Version-2**: Using another constant value for `YY`, e.g., `-1`.
@@ -198,17 +198,17 @@ public class Integers {
int r = 0;
for (int i = 0; i < size; i++) {
probe1 ^= probe1 << 13;
- r += Integer.compareUnsigned(probe1, Integer.MAX_VALUE);
+ r += Integer.compareUnsigned(probe1, -1);
probe2 ^= probe2 << 13;
probe1 ^= probe1 >>> 17;
- r += Integer.compareUnsigned(probe2, Integer.MAX_VALUE);
- r += Integer.compareUnsigned(probe1, Integer.MAX_VALUE);
+ r += Integer.compareUnsigned(probe2, -1);
+ r += Integer.compareUnsigned(probe1, -1);
probe2 ^= probe2 >>> 17;
probe1 ^= probe1 << 5;
- r += Integer.compareUnsigned(probe2, Integer.MAX_VALUE);
- r += Integer.compareUnsigned(probe1, Integer.MAX_VALUE);
+ r += Integer.compareUnsigned(probe2, -1);
+ r += Integer.compareUnsigned(probe1, -1);
probe2 ^= probe2 << 5;
- r += Integer.compareUnsigned(probe2, Integer.MAX_VALUE);
+ r += Integer.compareUnsigned(probe2, -1);
}
bh.consume(r);
}
**Version-3**: Using some random value for `YY`, e.g., `probe1, probe2, r`.
@@ -198,17 +198,17 @@ public class Integers {
int r = 0;
for (int i = 0; i < size; i++) {
probe1 ^= probe1 << 13;
- r += Integer.compareUnsigned(probe1, -1);
+ r += Integer.compareUnsigned(probe1, r);
probe2 ^= probe2 << 13;
probe1 ^= probe1 >>> 17;
- r += Integer.compareUnsigned(probe2, -1);
- r += Integer.compareUnsigned(probe1, -1);
+ r += Integer.compareUnsigned(probe2, probe1);
+ r += Integer.compareUnsigned(probe1, r);
probe2 ^= probe2 >>> 17;
probe1 ^= probe1 << 5;
- r += Integer.compareUnsigned(probe2, -1);
- r += Integer.compareUnsigned(probe1, -1);
+ r += Integer.compareUnsigned(probe2, probe1);
+ r += Integer.compareUnsigned(probe1, r);
probe2 ^= probe2 << 5;
- r += Integer.compareUnsigned(probe2, -1);
+ r += Integer.compareUnsigned(probe2, probe1);
}
bh.consume(r);
}
Here shows all the performance data on aarch64.
Integer.MAX_VALUE (version-1)
--------------------------------------------------------------------
Before After Unit
compareUnsignedDirect2 2.986 ± 0.019 1.858 ± 0.008 us/op
-1 (version-2)
--------------------------------------------------------------------
Before After Unit
compareUnsignedDirect2 1.384 ± 0.002 1.858 ± 0.002 us/op
probe1,probe2,r (version-3)
--------------------------------------------------------------------
Before After Unit
compareUnsignedDirect2 2.488 ± 0.076 2.517 ± 0.013 us/op
Note that for **version-1 and version-2**:
1) **After** column: the data is nearly the same if using the intrinsic (), as our `cmp; cset; cneg` sequence is predictable.
2) **Before** column: the data differs a lot. IMO, the advantage of branch prediction is most utilized for **version-2**, since `-1` will be treated as the largest unsigned value, and most comparisons can be predicated.
Note that for **version-3** (i.e. both `XX` and `YY` are random values):
1) **Before** column: the data may vary a bit for different runs. Sometimes it ls slightly bigger than **After** column, that is, using the intrinsic can introduce performance uplift. Sometimes, it's slightly smaller than **After** column, that is, using the intrinsic can introduce performance regression.
-------------
PR: https://git.openjdk.org/jdk/pull/11383
More information about the hotspot-compiler-dev
mailing list