[C2] PEXT/PDEP intrinsics cause performance regression on AMD pre-Zen 3 CPUs

Galder Zamarreno galder at ibm.com
Fri Dec 12 15:38:11 UTC 2025


Hi Alessandro,

I've created https://bugs.openjdk.org/browse/JDK-8373613 to track this.
What about you send a PR with the proposed fix?

Our team has AMD servers that help validate the suggested changes.

Thanks!

Alessandro Autiero <alautiero at gmail.com> writes:

> Hi,
>
> today I stumbled upon a performance issue with the Long.compress/expand and
> Integer.compress/expand intrinsics on certain AMD processors. I discovered
> this while working on an optimized varint decoder where I was hoping to use
> Long.compress() to speed up bit extraction. Instead, I found my "optimized"
> version was slower than my naive loop-based implementation. After some
> digging, I believe I understand what's happening.
>
> **Background**
>
> The compress and expand methods (added in JDK 19 via JDK-8283893 [1]) are
> intrinsified by C2 to use the BMI2 PEXT and PDEP instructions when the CPU
> reports BMI2 support.
> This works great on Intel Haswell+ and AMD Zen 3+, where these instructions
> execute in dedicated hardware with approximately 3-cycle latency.
> However, AMD processors from Excavator before Zen 3 implement PEXT/PDEP via
> microcode emulation rather than native hardware.
> This is confirmed by AMD's Software Optimization Guide for Family 19h
> Processors [2], Section 2.10.2, which states that Zen 3 has native ALU
> support for these instructions.
> Wikipedia's page on x86 Bit Manipulation Instruction Sets [3] also
> documents this behavior:
>
>> AMD processors before Zen 3 that implement PDEP and PEXT do so in
>> microcode, with a latency of 18 cycles rather than (Zen 3) 3 cycles. As a
>> result it is often faster to use other instructions on these processors.
>
>
> **Reproducer**
>
> Here is a JMH benchmark that demonstrates the issue by comparing the
> intrinsified path against the software fallback using ControlIntrinsic
> flags:
>
> ```
> import org.openjdk.jmh.annotations.*;
>
> import java.util.concurrent.ThreadLocalRandom;
> import java.util.concurrent.TimeUnit;
>
> @BenchmarkMode(Mode.AverageTime)
> @OutputTimeUnit(TimeUnit.NANOSECONDS)
> @Warmup(iterations = 5, time = 1)
> @Measurement(iterations = 5, time = 1)
> @State(Scope.Benchmark)
> public class PextPdepPerformanceBug {
>     // I'm not using constants to prevent constant folding
>     private long longValue;
>     private long longMask;
>     private int intValue;
>     private int intMask;
>
>     @Setup(Level.Iteration)
>     public void setup() {
>         var rng = ThreadLocalRandom.current();
>         longValue = rng.nextLong();
>         longMask = rng.nextLong();
>         intValue = rng.nextInt();
>         intMask = rng.nextInt();
>     }
>
>     // Long.compress (PEXT 64-bit)
>
>     @Benchmark
>     @Fork(value = 2, jvmArgsAppend = {
>         "-XX:+UnlockDiagnosticVMOptions",
>         "-XX:ControlIntrinsic=-_compress_l",
>         "-Xcomp"
>     })
>     public long compressLongSoftware() {
>         return Long.compress(longValue, longMask);
>     }
>
>     @Benchmark
>     @Fork(value = 2, jvmArgsAppend = {
>         "-XX:+UnlockDiagnosticVMOptions",
>         "-XX:ControlIntrinsic=+_compress_l",
>         "-Xcomp"
>     })
>     public long compressLongIntrinsic() {
>         return Long.compress(longValue, longMask);
>     }
>
>     // Long.expand (PDEP 64-bit)
>
>     @Benchmark
>     @Fork(value = 2, jvmArgsAppend = {
>         "-XX:+UnlockDiagnosticVMOptions",
>         "-XX:ControlIntrinsic=-_expand_l",
>         "-Xcomp"
>     })
>     public long expandLongSoftware() {
>         return Long.expand(longValue, longMask);
>     }
>
>     @Benchmark
>     @Fork(value = 2, jvmArgsAppend = {
>         "-XX:+UnlockDiagnosticVMOptions",
>         "-XX:ControlIntrinsic=+_expand_l",
>         "-Xcomp"
>     })
>     public long expandLongIntrinsic() {
>         return Long.expand(longValue, longMask);
>     }
>
>     // Integer.compress (PEXT 32-bit)
>
>     @Benchmark
>     @Fork(value = 2, jvmArgsAppend = {
>         "-XX:+UnlockDiagnosticVMOptions",
>         "-XX:ControlIntrinsic=-_compress_i",
>         "-Xcomp"
>     })
>     public int compressIntSoftware() {
>         return Integer.compress(intValue, intMask);
>     }
>
>     @Benchmark
>     @Fork(value = 2, jvmArgsAppend = {
>         "-XX:+UnlockDiagnosticVMOptions",
>         "-XX:ControlIntrinsic=+_compress_i",
>         "-Xcomp"
>     })
>     public int compressIntIntrinsic() {
>         return Integer.compress(intValue, intMask);
>     }
>
>     // Integer.expand (PDEP 32-bit)
>
>     @Benchmark
>     @Fork(value = 2, jvmArgsAppend = {
>         "-XX:+UnlockDiagnosticVMOptions",
>         "-XX:ControlIntrinsic=-_expand_i",
>         "-Xcomp"
>     })
>     public int expandIntSoftware() {
>         return Integer.expand(intValue, intMask);
>     }
>
>     @Benchmark
>     @Fork(value = 2, jvmArgsAppend = {
>         "-XX:+UnlockDiagnosticVMOptions",
>         "-XX:ControlIntrinsic=+_expand_i",
>         "-Xcomp"
>     })
>     public int expandIntIntrinsic() {
>         return Integer.expand(intValue, intMask);
>     }
> }
> ```
>
> Here are the results on an i7 9700K, which supports the BMI2 instruction
> set and is not affected by this issue:
> ```
> Benchmark                                     Mode  Cnt   Score   Error
>  Units
> PextPdepPerformanceBug.compressIntIntrinsic   avgt   10   0,545 ± 0,002
>  ns/op
> PextPdepPerformanceBug.compressIntSoftware    avgt   10  11,357 ± 0,033
>  ns/op
> PextPdepPerformanceBug.compressLongIntrinsic  avgt   10   0,552 ± 0,012
>  ns/op
> PextPdepPerformanceBug.compressLongSoftware   avgt   10  16,197 ± 0,203
>  ns/op
> PextPdepPerformanceBug.expandIntIntrinsic     avgt   10   0,546 ± 0,006
>  ns/op
> PextPdepPerformanceBug.expandIntSoftware      avgt   10  12,179 ± 0,457
>  ns/op
> PextPdepPerformanceBug.expandLongIntrinsic    avgt   10   0,548 ± 0,018
>  ns/op
> PextPdepPerformanceBug.expandLongSoftware     avgt   10  17,658 ± 0,534
>  ns/op
> ```
>
> And here are the results on a Ryzen 7 2700, which supports the BMI2
> instruction set. but is also affected by this issue:
> ```
> Benchmark                                     Mode  Cnt   Score    Error
>  Units
> PextPdepPerformanceBug.compressIntIntrinsic   avgt   10  28.010 ±  9.929
>  ns/op
> PextPdepPerformanceBug.compressIntSoftware    avgt   10  20.008 ±  2.129
>  ns/op
> PextPdepPerformanceBug.compressLongIntrinsic  avgt   10  48.999 ±  8.468
>  ns/op
> PextPdepPerformanceBug.compressLongSoftware   avgt   10  28.638 ±  5.336
>  ns/op
> PextPdepPerformanceBug.expandIntIntrinsic     avgt   10  24.860 ±  6.784
>  ns/op
> PextPdepPerformanceBug.expandIntSoftware      avgt   10  19.277 ±  1.719
>  ns/op
> PextPdepPerformanceBug.expandLongIntrinsic    avgt   10  43.889 ± 10.575
>  ns/op
> PextPdepPerformanceBug.expandLongSoftware     avgt   10  27.350 ±  1.898
>  ns/op
> ```
>
> **Precedent and Scope**
>
> A similar issue was reported in JDK-8334474 [4], where the compress/expand
> intrinsics were disabled on RISC-V because the vectorized implementation
> caused regressions compared to the pure-Java fallback.
> This led me to investigate whether other JDK intrinsics relying on BMI2
> instructions might be affected.
> The good news is that, as stated before, PEXT and PDEP are the only BMI2
> instructions that AMD implemented via microcode on pre-Zen 3 processors:
> the others execute efficiently on all BMI2-capable hardware.
> I also verified that no other JDK methods use PEXT/PDEP, so the four
> methods covered in this report (Long.compress, Long.expand,
> Integer.compress, Integer.expand) should be the only ones affected.
> It's worth verifying this though as the JDK is very large and I could have
> missed such examples.
>
> **Mitigation**
>
> The intrinsic selection logic should check both BMI2 support and CPU
> vendor/family.
> Specifically, disable these intrinsics when the CPU vendor is AMD and the
> family is less than 0x19 (Zen 3).
> I think this could be implemented in x86.ad [5], alongside the existing
> BMI2 check, but I'm not familiar with C2's source code.
> Still, I would be happy to work on this issue myself if the issue is
> verified and it's acceptable for me to work on it.
>
> Thanks for reading!
>
> [1] https://bugs.openjdk.org/browse/JDK-8283893  
> [2] https://developer.amd.com/resources/developer-guides-manuals/  
> [3] https://en.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set  
> [4] https://bugs.openjdk.org/browse/JDK-8334474  
> [5]
> https://github.com/jatin-bhateja/jdk/blob/7d35a283cf2497565d230e3d5426f563f7e5870d/src/hotspot/cpu/x86/x86.ad#L3183  

-- 
Galder Zamarreño
Software Developer
IBM Software
galder at ibm.com

IBM


More information about the hotspot-compiler-dev mailing list