Question: "backwards" adressing

Tue Sep 8 08:06:35 PDT 2009

The unit of memory access for current x86 designs is larger than a 
single 2-byte word,
being usually at least 8 bytes.  The processors have store combining 
buffers that
merge stores to the same line if they happen close enough together in 
time, so a
small number of store instructions such as in your example will all 
merge up into
at most 3 store buffer entries, regardless of instruction ordering.

Paul

Ulf Zibis wrote:
> Christian, thanks for remembering my question.
>
>
> Am 05.09.2009 11:20, Christian Thalinger schrieb:
>> Ulf Zibis wrote:
>>  
>>> Do you know any reason for this ?
>>>     
>>
>> Let's see what code is generated (on 32-bit x86)...
>>
>>  
>>>     static void loop1(int off, char in1, char in2, char[] out) {
>>>         out[off+3] = in1;
>>>         out[off+5] = in2;
>>>         out[off+0] = in1;
>>>         out[off+4] = in2;
>>>         out[off+9] = in1;
>>>         out[off+8] = in2;
>>>         out[off+6] = in1;
>>>         out[off+1] = in2;
>>>         out[off+7] = in1;
>>>         out[off+2] = in2;
>>>     }
>>>     
>>
>> 030       MOV16  [EDI + #22 + ECX << #1],EBP
>> 035       MOV16  [EDI + #12 + ECX << #1],EDX
>> 03a       MOV16  [EDI + #20 + ECX << #1],EBP
>> 03f       MOV16  [EDI + #30 + ECX << #1],EDX
>> 044       MOV16  [EDI + #28 + ECX << #1],EBP
>> 049       MOV16  [EDI + #24 + ECX << #1],EDX
>> 04e       MOV16  [EDI + #14 + ECX << #1],EBP
>> 053       MOV16  [EDI + #26 + ECX << #1],EDX
>> 058       MOV16  [EDI + #16 + ECX << #1],EBP
>>
>>  
>>>     static void loop2(int off, char in1, char in2, char[] out) {
>>>         out[off++] = in1;
>>>         out[off++] = in2;
>>>         out[off++] = in1;
>>>         out[off++] = in2;
>>>         out[off++] = in1;
>>>         out[off++] = in2;
>>>         out[off++] = in1;
>>>         out[off++] = in2;
>>>         out[off++] = in1;
>>>         out[off++] = in2;
>>>     }
>>>     
>>
>> 02e       MOV16  [EBX + #14 + ECX << #1],EDI
>> 033       MOV16  [EBX + #16 + ECX << #1],EDX
>> 038       MOV16  [EBX + #18 + ECX << #1],EDI
>> 03d       MOV16  [EBX + #20 + ECX << #1],EDX
>> 042       MOV16  [EBX + #22 + ECX << #1],EDI
>> 047       MOV16  [EBX + #24 + ECX << #1],EDX
>> 04c       MOV16  [EBX + #26 + ECX << #1],EDI
>> 051       MOV16  [EBX + #28 + ECX << #1],EDX
>> 056       MOV16  [EBX + #30 + ECX << #1],EDI
>>
>> As you can see the instruction sequence is almost the same, except the
>> ordering.  Theoretically the second one should be faster as continuous
>> writes should perform better.
>>   
>
> Yes, that's what I'm wondering too, because it's even contrariwise in 
> my test.
> Additionally I don't understand:
> - the additional shift by #1, so the address in loop2 would be 
> incremented by 4 (or is there a parenthesis missing around (ECX << #1) 
> ?), but we are in a char[] not int[]
> - Why doesn't hotspot compile to INC opcodes? I think, CPU's won't 
> have INC, if they wouldn't have advantages. ???
> - This complex addressing moded MOV opcode needs 5 bytes to be loaded 
> each, I guess, INC should be shorter.
> - Doesn't x86 have a combined MOV_&_INC opcode ?
>
> Stupid questions? OK, it's long time ago, I was programming in 
> assembler, and modern x86 didn't exist that time.
>
> -Ulf
>
>
>
>