Question regarding the (x86) assembler's use of NOP

Sun Sep 12 23:10:29 PDT 2010

David,

LEA NOPs have register dependency. '0x0F 0x1F' is fast multi-byte NOP:

Intel® 64 and IA-32 Architectures Software Developer's Manual
Volume 2B: Instruction Set Reference, N-Z

Vladimir

On 9/12/10 9:50 PM, David Dabbs wrote:
> Hi.
>
> I've been trying to trace PrintAssembly output back to the HS assembler.
> I noticed that the NOP generation code differs from the Intel
> recommendations and was
> wondering if someone could comment on the discrepancies.
>
>
> Thanks,
>
> David
>
>
>
> The Intel Arch Optimization Guide recommends the following regarding NOPs:
>
> 3.5.1.8 Using NOPs
> Code generators generate a no-operation (NOP) to align instructions.
> Examples of NOPs of different lengths in 32-bit mode are shown below:
>
>    1-byte: XCHG EAX, EAX
>    2-byte: 66 NOP
>    3-byte: LEA REG, 0 (REG) (8-bit displacement)
>    4-byte: NOP DWORD PTR [EAX + 0] (8-bit displacement)
>    5-byte: NOP DWORD PTR [EAX + EAX*1 + 0] (8-bit displacement)
>    6-byte: LEA REG, 0 (REG) (32-bit displacement)
>    7-byte: NOP DWORD PTR [EAX + 0] (32-bit displacement)
>    8-byte: NOP DWORD PTR [EAX + EAX*1 + 0] (32-bit displacement)
>    9-byte: NOP WORD  PTR [EAX + EAX*1 + 0] (32-bit displacement)
>
> These are all true NOPs, having no effect on the state of the machine except
> to
> advance the EIP. Because NOPs require hardware resources to decode and
> execute,
> use the fewest number to achieve the desired padding.
>
> The one byte NOP:[XCHG EAX,EAX] has special hardware support. Although it
> still
> consumes a μop and its accompanying resources, the dependence upon the old
> value
> of EAX is removed. This μop can be executed at the earliest possible
> opportunity,
> reducing the number of outstanding instructions and is the lowest cost NOP.
>
> The other NOPs have no special hardware support. Their input and output
> registers
> are interpreted by the hardware. Therefore, a code generator should arrange
> to use
> the register containing the oldest value as input, so that the NOP will
> dispatch and
> release RS resources at the earliest possible opportunity.
>
> Try to observe the following NOP generation priority. Select:
> * the smallest number of NOPs and pseudo-NOPs to provide the desired
>    padding.
> * NOPs that are least likely to execute on slower execution unit clusters.
> * the register arguments of NOPs to reduce dependencies.
>
> // end Intel Arch ---------------------------
>
>
> The code in assembler_x86.cpp however issues NOPs using:
>
> void Assembler::nop(int i) {
> #ifdef ASSERT
>    assert(i>  0, " ");
>    // The fancy nops aren't currently recognized by debuggers making it a
>    // pain to disassemble code while debugging. If asserts are on clearly
>    // speed is not an issue so simply use the single byte traditional nop
>    // to do alignment.
>
>    for (; i>  0 ; i--) emit_byte(0x90);
>    return;
>
> #endif // ASSERT
>
>    if (UseAddressNop&&  VM_Version::is_intel()) {
>      //
>      // Using multi-bytes nops "0x0F 0x1F [address]" for Intel
>      //  1: 0x90
>      //  2: 0x66 0x90
>      //  3: 0x66 0x66 0x90 (don't use "0x0F 0x1F 0x00" - need patching safe
> padding)
>      //  4: 0x0F 0x1F 0x40 0x00
>      //  5: 0x0F 0x1F 0x44 0x00 0x00
>      //  6: 0x66 0x0F 0x1F 0x44 0x00 0x00
>      //  7: 0x0F 0x1F 0x80 0x00 0x00 0x00 0x00
>      //  8: 0x0F 0x1F 0x84 0x00 0x00 0x00 0x00 0x00
>      //  9: 0x66 0x0F 0x1F 0x84 0x00 0x00 0x00 0x00 0x00
>      // 10: 0x66 0x66 0x0F 0x1F 0x84 0x00 0x00 0x00 0x00 0x00
>      // 11: 0x66 0x66 0x66 0x0F 0x1F 0x84 0x00 0x00 0x00 0x00 0x00
>
>      // The rest coding is Intel specific - don't use consecutive address
> nops
>
>      // 12: 0x0F 0x1F 0x84 0x00 0x00 0x00 0x00 0x00 0x66 0x66 0x66 0x90
>      // 13: 0x66 0x0F 0x1F 0x84 0x00 0x00 0x00 0x00 0x00 0x66 0x66 0x66 0x90
>      // 14: 0x66 0x66 0x0F 0x1F 0x84 0x00 0x00 0x00 0x00 0x00 0x66 0x66 0x66
> 0x90
>      // 15: 0x66 0x66 0x66 0x0F 0x1F 0x84 0x00 0x00 0x00 0x00 0x00 0x66 0x66
> 0x66 0x90
>
>      while(i>= 15) {
>        // For Intel don't generate consecutive addess nops (mix with regular
> nops)
>        i -= 15;
>        emit_byte(0x66);   // size prefix
>        emit_byte(0x66);   // size prefix
>        emit_byte(0x66);   // size prefix
>        addr_nop_8();
>        emit_byte(0x66);   // size prefix
>        emit_byte(0x66);   // size prefix
>        emit_byte(0x66);   // size prefix
>        emit_byte(0x90);   // nop
>      }
>      switch (i) {
>        case 14:
>          emit_byte(0x66); // size prefix
>        case 13:
>          emit_byte(0x66); // size prefix
>        case 12:
>          addr_nop_8();
>          emit_byte(0x66); // size prefix
>          emit_byte(0x66); // size prefix
>          emit_byte(0x66); // size prefix
>          emit_byte(0x90); // nop
>          break;
>        case 11:
>          emit_byte(0x66); // size prefix
>        case 10:
>          emit_byte(0x66); // size prefix
>        case 9:
>          emit_byte(0x66); // size prefix
>        case 8:
>          addr_nop_8();
>          break;
>        case 7:
>          addr_nop_7();
>          break;
>        case 6:
>          emit_byte(0x66); // size prefix
>        case 5:
>          addr_nop_5();
>          break;
>        case 4:
>          addr_nop_4();
>          break;
>        case 3:
>          // Don't use "0x0F 0x1F 0x00" - need patching safe padding
>          emit_byte(0x66); // size prefix
>        case 2:
>          emit_byte(0x66); // size prefix
>        case 1:
>          emit_byte(0x90); // nop
>          break;
>        default:
>          assert(i == 0, " ");
>      }
>      return;
>    }
>
>
> void Assembler::addr_nop_4() {
>    // 4 bytes: NOP DWORD PTR [EAX+0]
>    emit_byte(0x0F);
>    emit_byte(0x1F);
>    emit_byte(0x40); // emit_rm(cbuf, 0x1, EAX_enc, EAX_enc);
>    emit_byte(0);    // 8-bits offset (1 byte)
> }
>
> void Assembler::addr_nop_5() {
>    // 5 bytes: NOP DWORD PTR [EAX+EAX*0+0] 8-bits offset
>    emit_byte(0x0F);
>    emit_byte(0x1F);
>    emit_byte(0x44); // emit_rm(cbuf, 0x1, EAX_enc, 0x4);
>    emit_byte(0x00); // emit_rm(cbuf, 0x0, EAX_enc, EAX_enc);
>    emit_byte(0);    // 8-bits offset (1 byte)
> }
>
> void Assembler::addr_nop_7() {
>    // 7 bytes: NOP DWORD PTR [EAX+0] 32-bits offset
>    emit_byte(0x0F);
>    emit_byte(0x1F);
>    emit_byte(0x80); // emit_rm(cbuf, 0x2, EAX_enc, EAX_enc);
>    emit_long(0);    // 32-bits offset (4 bytes)
> }
>
> void Assembler::addr_nop_8() {
>    // 8 bytes: NOP DWORD PTR [EAX+EAX*0+0] 32-bits offset
>    emit_byte(0x0F);
>    emit_byte(0x1F);
>    emit_byte(0x84); // emit_rm(cbuf, 0x2, EAX_enc, 0x4);
>    emit_byte(0x00); // emit_rm(cbuf, 0x0, EAX_enc, EAX_enc);
>    emit_long(0);    // 32-bits offset (4 bytes)
> }
>
>
>
>