RFR: 8351016: RA support for EVEX to REX/REX2 demotion to optimize NDD instructions [v6]

Wed Nov 12 04:16:06 UTC 2025

On Tue, 21 Oct 2025 12:17:46 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Jatin Bhateja has updated the pull request with a new target base due to a merge or a rebase. The pull request now contains eight commits:
>> 
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8351016
>>  - Limiting register biasing to NDD specific demotable instructions
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8351016
>>  - Fix jtreg, one less spill
>>  - Updating as per reivew suggestions
>>  - Merge branch 'master' of http://github.com/openjdk/jdk into JDK-8351016
>>  - Some refactoring
>>  - 8351016: RA support for EVEX to REX/REX2 demotion to optimize NDD instructions
>
> Current scheme of validation is manual:-
> 
> 1) Revert https://github.com/openjdk/jdk/pull/27320, since SDE 9.58 does not support APX_NCI_NDD_NF flag yet.
> 2) Static register allocation ordering change in x86_64.ad to always prefer EGPR R16-R31 during allocation.
> 3) Register allocation biasing facilitates demotion, which happens in the assembler layer.
> 4) Added debug messages in demotable assembler routines.  
> 5) Inspected the assembler encoding in Intel xed64
> 6) Ran the following tests with -XX:-UseSuperWord to exercise various NDD demotable instructions with Intel SDE 9.58.
>    - test/hotspot/jtreg/compiler/c2/cr6340864/TestIntVect.java
>    - test/hotspot/jtreg/compiler/c2/cr6340864/TestLongVect.java
> 
> **By limiting the scope of the fix to NDD-specific instructions, we have now mitigated any unwanted performance side effects on other backends OR non-APX x86 backends.**
> 
> We do have existing tests in place for functional correctness of NDD assembler instructions https://github.com/openjdk/jdk/blob/master/test/hotspot/gtest/x86/x86-asmtest.py

> Thanks for working on this @jatin-bhateja! I think the code changes themselves look sound, but I would like a bit more information about the performance and code size improvements. I'm also running some additional testing and benchmarking, and will let you know when I have the results.
> 
> > The patch shows around 5-20% improvement in code size by facilitating NDD demotion.
> 
> Can you elaborate on how you measured this improvement?
> 

Hi @dlunde , improvements are gauged by inspecting the JIT code size. Every NDD instruction expects a 4-byte extended EVEX prefix. By demoting its to REX/REX2 prefix, we save 2-3 bytes per instruction.  For example, consider the following micro kernel, with this patch, almost every NDD instruction gets the benefit of register biasing, and thus the assembler layer demotes these REX/REX2 prefixed instructions.

Kernel:-
--------
    public static long micro(long arg1, long [] arg2, long arg3, long arg4, int ctr) {
       long t1 = arg1 + arg2[ctr] + arg3 + arg4;
       long t2 = arg1 * arg2[ctr] * arg3 * arg4;
       long t3 = arg1 ^ arg2[ctr] ^ arg3 ^ arg4;
       long t4 = arg1 | arg2[ctr] | arg3 | arg4;
       long t5 = arg1 & arg2[ctr] & arg3 & arg4;
       return t1 + t2 + t3 + t4 + t5;
    }

OptoAssembly with patch:-
-----------------
028     eandq     R11, RSI, R10 # long ndd
02e     eimulq   R9, RSI, R10   # long ndd
034     eandq     R11, R11, RCX # long ndd
037     eimulq   R9, R9, RCX    # long ndd
03b     eandq     R11, R11, R8  # long ndd
03e     eimulq   R9, R9, R8     # long ndd
042     eaddq    RBX, RSI, R10  # long ndd
048     exorq    RDI, RSI, R10  # long ndd
04e     eaddq    RBX, RBX, RCX  # long ndd
051     exorq    RDI, RDI, RCX  # long ndd
054     eaddq    RBX, RBX, R8   # long ndd
057     exorq    RDI, RDI, R8   # long ndd
05a     eaddq    RBX, RBX, R9   # long ndd
05d     eorq     RSI, RSI, R10  # long ndd
060     eaddq    RDI, RDI, RBX  # long ndd
063     eorq     RSI, RSI, RCX  # long ndd
066     eorq     RSI, RSI, R8   # long ndd
069     eaddq    RSI, RSI, RDI  # long ndd

Disassembly of JIT code:-
---------------------------
EMR>xed64 -64 -d 4803d94833f94903d84933f84903d9
4803D94833F94903D84933F84903D9
ICLASS:     ADD
CATEGORY:   BINARY
EXTENSION:  BASE
IFORM:      ADD_GPRv_GPRv_03
ISA_SET:    I86
ATTRIBUTES: SCALABLE
SHORT:      add rbx, rcx
4833F94903D84933F84903D9
ICLASS:     XOR
CATEGORY:   LOGICAL
EXTENSION:  BASE
IFORM:      XOR_GPRv_GPRv_33
ISA_SET:    I86
ATTRIBUTES: SCALABLE
SHORT:      xor rdi, rcx
4903D84933F84903D9
ICLASS:     ADD
CATEGORY:   BINARY
EXTENSION:  BASE
IFORM:      ADD_GPRv_GPRv_03
ISA_SET:    I86
ATTRIBUTES: SCALABLE
SHORT:      add rbx, r8
4933F84903D9
ICLASS:     XOR
CATEGORY:   LOGICAL
EXTENSION:  BASE
IFORM:      XOR_GPRv_GPRv_33
ISA_SET:    I86
ATTRIBUTES: SCALABLE
SHORT:      xor rdi, r8
4903D9
ICLASS:     ADD
CATEGORY:   BINARY
EXTENSION:  BASE
IFORM:      ADD_GPRv_GPRv_03
ISA_SET:    I86
ATTRIBUTES: SCALABLE
SHORT:      add rbx, r9

> > Thorough validations are underway using the latest Intel Software Development Emulator version 9.58.
> 
> Great, can you elaborate more on this? What types of validations?
> 
The current scheme of validation is mostly manual, but running some tests under Intel SDE and inspecting OptoAssembly and disassembling JIT code, and also adding debug messages [in [](assembler.](https://github.com/jatin-bhateja/external_staging/blob/main/Backup/reg_alloc_ndd_demotion_validation.diff)), I have listed down validation configuration [above](https://github.com/openjdk/jdk/pull/26283#issuecomment-3426307551)

> Also, here is a patch with some simple style and wording fixes: [dlunde at d2b5118](https://github.com/dlunde/jdk/commit/d2b511804c757c89c5662028ea9e4a9dff43b641). I know you just moved some of the affected code around, but we might as well fix a few style issues while we are at it.

Thanks!, I have modified some code, so these anomalies are taken care of.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/26283#issuecomment-3519857868