[C2] PEXT/PDEP intrinsics cause performance regression on AMD pre-Zen 3 CPUs

Sun Nov 30 16:50:44 UTC 2025

Hi,

today I stumbled upon a performance issue with the Long.compress/expand and
Integer.compress/expand intrinsics on certain AMD processors. I discovered
this while working on an optimized varint decoder where I was hoping to use
Long.compress() to speed up bit extraction. Instead, I found my "optimized"
version was slower than my naive loop-based implementation. After some
digging, I believe I understand what's happening.

**Background**

The compress and expand methods (added in JDK 19 via JDK-8283893 [1]) are
intrinsified by C2 to use the BMI2 PEXT and PDEP instructions when the CPU
reports BMI2 support.
This works great on Intel Haswell+ and AMD Zen 3+, where these instructions
execute in dedicated hardware with approximately 3-cycle latency.
However, AMD processors from Excavator before Zen 3 implement PEXT/PDEP via
microcode emulation rather than native hardware.
This is confirmed by AMD's Software Optimization Guide for Family 19h
Processors [2], Section 2.10.2, which states that Zen 3 has native ALU
support for these instructions.
Wikipedia's page on x86 Bit Manipulation Instruction Sets [3] also
documents this behavior:

> AMD processors before Zen 3 that implement PDEP and PEXT do so in
> microcode, with a latency of 18 cycles rather than (Zen 3) 3 cycles. As a
> result it is often faster to use other instructions on these processors.

**Reproducer**

Here is a JMH benchmark that demonstrates the issue by comparing the
intrinsified path against the software fallback using ControlIntrinsic
flags:

```
import org.openjdk.jmh.annotations.*;

import java.util.concurrent.ThreadLocalRandom;
import java.util.concurrent.TimeUnit;

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 1)
@Measurement(iterations = 5, time = 1)
@State(Scope.Benchmark)
public class PextPdepPerformanceBug {
    // I'm not using constants to prevent constant folding
    private long longValue;
    private long longMask;
    private int intValue;
    private int intMask;

    @Setup(Level.Iteration)
    public void setup() {
        var rng = ThreadLocalRandom.current();
        longValue = rng.nextLong();
        longMask = rng.nextLong();
        intValue = rng.nextInt();
        intMask = rng.nextInt();
    }

    // Long.compress (PEXT 64-bit)

    @Benchmark
    @Fork(value = 2, jvmArgsAppend = {
        "-XX:+UnlockDiagnosticVMOptions",
        "-XX:ControlIntrinsic=-_compress_l",
        "-Xcomp"
    })
    public long compressLongSoftware() {
        return Long.compress(longValue, longMask);
    }

    @Benchmark
    @Fork(value = 2, jvmArgsAppend = {
        "-XX:+UnlockDiagnosticVMOptions",
        "-XX:ControlIntrinsic=+_compress_l",
        "-Xcomp"
    })
    public long compressLongIntrinsic() {
        return Long.compress(longValue, longMask);
    }

    // Long.expand (PDEP 64-bit)

    @Benchmark
    @Fork(value = 2, jvmArgsAppend = {
        "-XX:+UnlockDiagnosticVMOptions",
        "-XX:ControlIntrinsic=-_expand_l",
        "-Xcomp"
    })
    public long expandLongSoftware() {
        return Long.expand(longValue, longMask);
    }

    @Benchmark
    @Fork(value = 2, jvmArgsAppend = {
        "-XX:+UnlockDiagnosticVMOptions",
        "-XX:ControlIntrinsic=+_expand_l",
        "-Xcomp"
    })
    public long expandLongIntrinsic() {
        return Long.expand(longValue, longMask);
    }

    // Integer.compress (PEXT 32-bit)

    @Benchmark
    @Fork(value = 2, jvmArgsAppend = {
        "-XX:+UnlockDiagnosticVMOptions",
        "-XX:ControlIntrinsic=-_compress_i",
        "-Xcomp"
    })
    public int compressIntSoftware() {
        return Integer.compress(intValue, intMask);
    }

    @Benchmark
    @Fork(value = 2, jvmArgsAppend = {
        "-XX:+UnlockDiagnosticVMOptions",
        "-XX:ControlIntrinsic=+_compress_i",
        "-Xcomp"
    })
    public int compressIntIntrinsic() {
        return Integer.compress(intValue, intMask);
    }

    // Integer.expand (PDEP 32-bit)

    @Benchmark
    @Fork(value = 2, jvmArgsAppend = {
        "-XX:+UnlockDiagnosticVMOptions",
        "-XX:ControlIntrinsic=-_expand_i",
        "-Xcomp"
    })
    public int expandIntSoftware() {
        return Integer.expand(intValue, intMask);
    }

    @Benchmark
    @Fork(value = 2, jvmArgsAppend = {
        "-XX:+UnlockDiagnosticVMOptions",
        "-XX:ControlIntrinsic=+_expand_i",
        "-Xcomp"
    })
    public int expandIntIntrinsic() {
        return Integer.expand(intValue, intMask);
    }
}
```

Here are the results on an i7 9700K, which supports the BMI2 instruction
set and is not affected by this issue:
```
Benchmark                                     Mode  Cnt   Score   Error
 Units
PextPdepPerformanceBug.compressIntIntrinsic   avgt   10   0,545 ± 0,002
 ns/op
PextPdepPerformanceBug.compressIntSoftware    avgt   10  11,357 ± 0,033
 ns/op
PextPdepPerformanceBug.compressLongIntrinsic  avgt   10   0,552 ± 0,012
 ns/op
PextPdepPerformanceBug.compressLongSoftware   avgt   10  16,197 ± 0,203
 ns/op
PextPdepPerformanceBug.expandIntIntrinsic     avgt   10   0,546 ± 0,006
 ns/op
PextPdepPerformanceBug.expandIntSoftware      avgt   10  12,179 ± 0,457
 ns/op
PextPdepPerformanceBug.expandLongIntrinsic    avgt   10   0,548 ± 0,018
 ns/op
PextPdepPerformanceBug.expandLongSoftware     avgt   10  17,658 ± 0,534
 ns/op
```

And here are the results on a Ryzen 7 2700, which supports the BMI2
instruction set. but is also affected by this issue:
```
Benchmark                                     Mode  Cnt   Score    Error
 Units
PextPdepPerformanceBug.compressIntIntrinsic   avgt   10  28.010 ±  9.929
 ns/op
PextPdepPerformanceBug.compressIntSoftware    avgt   10  20.008 ±  2.129
 ns/op
PextPdepPerformanceBug.compressLongIntrinsic  avgt   10  48.999 ±  8.468
 ns/op
PextPdepPerformanceBug.compressLongSoftware   avgt   10  28.638 ±  5.336
 ns/op
PextPdepPerformanceBug.expandIntIntrinsic     avgt   10  24.860 ±  6.784
 ns/op
PextPdepPerformanceBug.expandIntSoftware      avgt   10  19.277 ±  1.719
 ns/op
PextPdepPerformanceBug.expandLongIntrinsic    avgt   10  43.889 ± 10.575
 ns/op
PextPdepPerformanceBug.expandLongSoftware     avgt   10  27.350 ±  1.898
 ns/op
```

**Precedent and Scope**

A similar issue was reported in JDK-8334474 [4], where the compress/expand
intrinsics were disabled on RISC-V because the vectorized implementation
caused regressions compared to the pure-Java fallback.
This led me to investigate whether other JDK intrinsics relying on BMI2
instructions might be affected.
The good news is that, as stated before, PEXT and PDEP are the only BMI2
instructions that AMD implemented via microcode on pre-Zen 3 processors:
the others execute efficiently on all BMI2-capable hardware.
I also verified that no other JDK methods use PEXT/PDEP, so the four
methods covered in this report (Long.compress, Long.expand,
Integer.compress, Integer.expand) should be the only ones affected.
It's worth verifying this though as the JDK is very large and I could have
missed such examples.

**Mitigation**

The intrinsic selection logic should check both BMI2 support and CPU
vendor/family.
Specifically, disable these intrinsics when the CPU vendor is AMD and the
family is less than 0x19 (Zen 3).
I think this could be implemented in x86.ad [5], alongside the existing
BMI2 check, but I'm not familiar with C2's source code.
Still, I would be happy to work on this issue myself if the issue is
verified and it's acceptable for me to work on it.

Thanks for reading!

[1] https://bugs.openjdk.org/browse/JDK-8283893
[2] https://developer.amd.com/resources/developer-guides-manuals/
[3] https://en.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set
[4] https://bugs.openjdk.org/browse/JDK-8334474
[5]
https://github.com/jatin-bhateja/jdk/blob/7d35a283cf2497565d230e3d5426f563f7e5870d/src/hotspot/cpu/x86/x86.ad#L3183
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mail.openjdk.org/pipermail/hotspot-compiler-dev/attachments/20251130/4d7bd0cf/attachment-0001.htm>