Question: "backwards" adressing

Sat Sep 5 02:20:22 PDT 2009

Ulf Zibis wrote:
> Do you know any reason for this ?

Let's see what code is generated (on 32-bit x86)...

>     static void loop1(int off, char in1, char in2, char[] out) {
>         out[off+3] = in1;
>         out[off+5] = in2;
>         out[off+0] = in1;
>         out[off+4] = in2;
>         out[off+9] = in1;
>         out[off+8] = in2;
>         out[off+6] = in1;
>         out[off+1] = in2;
>         out[off+7] = in1;
>         out[off+2] = in2;
>     }

030   	MOV16  [EDI + #22 + ECX << #1],EBP
035   	MOV16  [EDI + #12 + ECX << #1],EDX
03a   	MOV16  [EDI + #20 + ECX << #1],EBP
03f   	MOV16  [EDI + #30 + ECX << #1],EDX
044   	MOV16  [EDI + #28 + ECX << #1],EBP
049   	MOV16  [EDI + #24 + ECX << #1],EDX
04e   	MOV16  [EDI + #14 + ECX << #1],EBP
053   	MOV16  [EDI + #26 + ECX << #1],EDX
058   	MOV16  [EDI + #16 + ECX << #1],EBP

> 
>     static void loop2(int off, char in1, char in2, char[] out) {
>         out[off++] = in1;
>         out[off++] = in2;
>         out[off++] = in1;
>         out[off++] = in2;
>         out[off++] = in1;
>         out[off++] = in2;
>         out[off++] = in1;
>         out[off++] = in2;
>         out[off++] = in1;
>         out[off++] = in2;
>     }

02e   	MOV16  [EBX + #14 + ECX << #1],EDI
033   	MOV16  [EBX + #16 + ECX << #1],EDX
038   	MOV16  [EBX + #18 + ECX << #1],EDI
03d   	MOV16  [EBX + #20 + ECX << #1],EDX
042   	MOV16  [EBX + #22 + ECX << #1],EDI
047   	MOV16  [EBX + #24 + ECX << #1],EDX
04c   	MOV16  [EBX + #26 + ECX << #1],EDI
051   	MOV16  [EBX + #28 + ECX << #1],EDX
056   	MOV16  [EBX + #30 + ECX << #1],EDI

As you can see the instruction sequence is almost the same, except the
ordering.  Theoretically the second one should be faster as continuous
writes should perform better.

Additionally in the final compiled version of main, where both loop1 and
loop2 are inlined, the loops are unrolled a couple of times (4x).

> Hm, that's what I thought as first, so in forward case cpu should do:
>    1. base+index -> index register
>    2. move highSurrogate(accumulator), [index register]
>    3. increment index register
>    4. move lowSurrogate(accumulator), [index register]
> 
> Backward case cpu should do:
>    1. base+index -> index register
>    2. move -> temp
>    3. increment index register
>    4. move lowSurrogate(accumulator), [index register]
>    5. load temp -> index register
>    6. move highSurrogate(accumulator), [index register]
> 
> Please excuse, that I'm insisting, but I really don't understand why 
> both should run in same time.
> Can you explain once more?

>From the assembly above you can see that x86 instructions have complex
addressing modes, and that's why there is no need for an index increment
between the writes.

-- Christian