[C2] PEXT/PDEP intrinsics cause performance regression on AMD pre-Zen 3 CPUs
Galder Zamarreno
galder at ibm.com
Fri Dec 12 15:38:11 UTC 2025
Hi Alessandro,
I've created https://bugs.openjdk.org/browse/JDK-8373613 to track this.
What about you send a PR with the proposed fix?
Our team has AMD servers that help validate the suggested changes.
Thanks!
Alessandro Autiero <alautiero at gmail.com> writes:
> Hi,
>
> today I stumbled upon a performance issue with the Long.compress/expand and
> Integer.compress/expand intrinsics on certain AMD processors. I discovered
> this while working on an optimized varint decoder where I was hoping to use
> Long.compress() to speed up bit extraction. Instead, I found my "optimized"
> version was slower than my naive loop-based implementation. After some
> digging, I believe I understand what's happening.
>
> **Background**
>
> The compress and expand methods (added in JDK 19 via JDK-8283893 [1]) are
> intrinsified by C2 to use the BMI2 PEXT and PDEP instructions when the CPU
> reports BMI2 support.
> This works great on Intel Haswell+ and AMD Zen 3+, where these instructions
> execute in dedicated hardware with approximately 3-cycle latency.
> However, AMD processors from Excavator before Zen 3 implement PEXT/PDEP via
> microcode emulation rather than native hardware.
> This is confirmed by AMD's Software Optimization Guide for Family 19h
> Processors [2], Section 2.10.2, which states that Zen 3 has native ALU
> support for these instructions.
> Wikipedia's page on x86 Bit Manipulation Instruction Sets [3] also
> documents this behavior:
>
>> AMD processors before Zen 3 that implement PDEP and PEXT do so in
>> microcode, with a latency of 18 cycles rather than (Zen 3) 3 cycles. As a
>> result it is often faster to use other instructions on these processors.
>
>
> **Reproducer**
>
> Here is a JMH benchmark that demonstrates the issue by comparing the
> intrinsified path against the software fallback using ControlIntrinsic
> flags:
>
> ```
> import org.openjdk.jmh.annotations.*;
>
> import java.util.concurrent.ThreadLocalRandom;
> import java.util.concurrent.TimeUnit;
>
> @BenchmarkMode(Mode.AverageTime)
> @OutputTimeUnit(TimeUnit.NANOSECONDS)
> @Warmup(iterations = 5, time = 1)
> @Measurement(iterations = 5, time = 1)
> @State(Scope.Benchmark)
> public class PextPdepPerformanceBug {
> // I'm not using constants to prevent constant folding
> private long longValue;
> private long longMask;
> private int intValue;
> private int intMask;
>
> @Setup(Level.Iteration)
> public void setup() {
> var rng = ThreadLocalRandom.current();
> longValue = rng.nextLong();
> longMask = rng.nextLong();
> intValue = rng.nextInt();
> intMask = rng.nextInt();
> }
>
> // Long.compress (PEXT 64-bit)
>
> @Benchmark
> @Fork(value = 2, jvmArgsAppend = {
> "-XX:+UnlockDiagnosticVMOptions",
> "-XX:ControlIntrinsic=-_compress_l",
> "-Xcomp"
> })
> public long compressLongSoftware() {
> return Long.compress(longValue, longMask);
> }
>
> @Benchmark
> @Fork(value = 2, jvmArgsAppend = {
> "-XX:+UnlockDiagnosticVMOptions",
> "-XX:ControlIntrinsic=+_compress_l",
> "-Xcomp"
> })
> public long compressLongIntrinsic() {
> return Long.compress(longValue, longMask);
> }
>
> // Long.expand (PDEP 64-bit)
>
> @Benchmark
> @Fork(value = 2, jvmArgsAppend = {
> "-XX:+UnlockDiagnosticVMOptions",
> "-XX:ControlIntrinsic=-_expand_l",
> "-Xcomp"
> })
> public long expandLongSoftware() {
> return Long.expand(longValue, longMask);
> }
>
> @Benchmark
> @Fork(value = 2, jvmArgsAppend = {
> "-XX:+UnlockDiagnosticVMOptions",
> "-XX:ControlIntrinsic=+_expand_l",
> "-Xcomp"
> })
> public long expandLongIntrinsic() {
> return Long.expand(longValue, longMask);
> }
>
> // Integer.compress (PEXT 32-bit)
>
> @Benchmark
> @Fork(value = 2, jvmArgsAppend = {
> "-XX:+UnlockDiagnosticVMOptions",
> "-XX:ControlIntrinsic=-_compress_i",
> "-Xcomp"
> })
> public int compressIntSoftware() {
> return Integer.compress(intValue, intMask);
> }
>
> @Benchmark
> @Fork(value = 2, jvmArgsAppend = {
> "-XX:+UnlockDiagnosticVMOptions",
> "-XX:ControlIntrinsic=+_compress_i",
> "-Xcomp"
> })
> public int compressIntIntrinsic() {
> return Integer.compress(intValue, intMask);
> }
>
> // Integer.expand (PDEP 32-bit)
>
> @Benchmark
> @Fork(value = 2, jvmArgsAppend = {
> "-XX:+UnlockDiagnosticVMOptions",
> "-XX:ControlIntrinsic=-_expand_i",
> "-Xcomp"
> })
> public int expandIntSoftware() {
> return Integer.expand(intValue, intMask);
> }
>
> @Benchmark
> @Fork(value = 2, jvmArgsAppend = {
> "-XX:+UnlockDiagnosticVMOptions",
> "-XX:ControlIntrinsic=+_expand_i",
> "-Xcomp"
> })
> public int expandIntIntrinsic() {
> return Integer.expand(intValue, intMask);
> }
> }
> ```
>
> Here are the results on an i7 9700K, which supports the BMI2 instruction
> set and is not affected by this issue:
> ```
> Benchmark Mode Cnt Score Error
> Units
> PextPdepPerformanceBug.compressIntIntrinsic avgt 10 0,545 ± 0,002
> ns/op
> PextPdepPerformanceBug.compressIntSoftware avgt 10 11,357 ± 0,033
> ns/op
> PextPdepPerformanceBug.compressLongIntrinsic avgt 10 0,552 ± 0,012
> ns/op
> PextPdepPerformanceBug.compressLongSoftware avgt 10 16,197 ± 0,203
> ns/op
> PextPdepPerformanceBug.expandIntIntrinsic avgt 10 0,546 ± 0,006
> ns/op
> PextPdepPerformanceBug.expandIntSoftware avgt 10 12,179 ± 0,457
> ns/op
> PextPdepPerformanceBug.expandLongIntrinsic avgt 10 0,548 ± 0,018
> ns/op
> PextPdepPerformanceBug.expandLongSoftware avgt 10 17,658 ± 0,534
> ns/op
> ```
>
> And here are the results on a Ryzen 7 2700, which supports the BMI2
> instruction set. but is also affected by this issue:
> ```
> Benchmark Mode Cnt Score Error
> Units
> PextPdepPerformanceBug.compressIntIntrinsic avgt 10 28.010 ± 9.929
> ns/op
> PextPdepPerformanceBug.compressIntSoftware avgt 10 20.008 ± 2.129
> ns/op
> PextPdepPerformanceBug.compressLongIntrinsic avgt 10 48.999 ± 8.468
> ns/op
> PextPdepPerformanceBug.compressLongSoftware avgt 10 28.638 ± 5.336
> ns/op
> PextPdepPerformanceBug.expandIntIntrinsic avgt 10 24.860 ± 6.784
> ns/op
> PextPdepPerformanceBug.expandIntSoftware avgt 10 19.277 ± 1.719
> ns/op
> PextPdepPerformanceBug.expandLongIntrinsic avgt 10 43.889 ± 10.575
> ns/op
> PextPdepPerformanceBug.expandLongSoftware avgt 10 27.350 ± 1.898
> ns/op
> ```
>
> **Precedent and Scope**
>
> A similar issue was reported in JDK-8334474 [4], where the compress/expand
> intrinsics were disabled on RISC-V because the vectorized implementation
> caused regressions compared to the pure-Java fallback.
> This led me to investigate whether other JDK intrinsics relying on BMI2
> instructions might be affected.
> The good news is that, as stated before, PEXT and PDEP are the only BMI2
> instructions that AMD implemented via microcode on pre-Zen 3 processors:
> the others execute efficiently on all BMI2-capable hardware.
> I also verified that no other JDK methods use PEXT/PDEP, so the four
> methods covered in this report (Long.compress, Long.expand,
> Integer.compress, Integer.expand) should be the only ones affected.
> It's worth verifying this though as the JDK is very large and I could have
> missed such examples.
>
> **Mitigation**
>
> The intrinsic selection logic should check both BMI2 support and CPU
> vendor/family.
> Specifically, disable these intrinsics when the CPU vendor is AMD and the
> family is less than 0x19 (Zen 3).
> I think this could be implemented in x86.ad [5], alongside the existing
> BMI2 check, but I'm not familiar with C2's source code.
> Still, I would be happy to work on this issue myself if the issue is
> verified and it's acceptable for me to work on it.
>
> Thanks for reading!
>
> [1] https://bugs.openjdk.org/browse/JDK-8283893
> [2] https://developer.amd.com/resources/developer-guides-manuals/
> [3] https://en.wikipedia.org/wiki/X86_Bit_manipulation_instruction_set
> [4] https://bugs.openjdk.org/browse/JDK-8334474
> [5]
> https://github.com/jatin-bhateja/jdk/blob/7d35a283cf2497565d230e3d5426f563f7e5870d/src/hotspot/cpu/x86/x86.ad#L3183
--
Galder Zamarreño
Software Developer
IBM Software
galder at ibm.com
IBM
More information about the hotspot-compiler-dev
mailing list