Field access optimisations inside loops (question).

Thu Jan 7 23:52:02 PST 2010

Hi everyone,

I have been wondering about one thing and failed to find the answer so far.

I observed the following behavior from the HotSpot compiler. Consider
the following two loops:

1.
       for (int i = 0; i < list.size(); i++)
       {
           value = list.get(i);
       }

2.
       final int size = list.size();
       final int [] buffer = list.buffer;
       for (int i = 0; i < size; i++)
       {
           value = buffer[i];
       }

The list variable's class is simple and the implementation of size()
and get() is trivial, basically:

get == return buffer[index];
size == return elementsCount;

(no boundary checks, inheritance or anything like this).

What's interesting is that loop (2) is at least TWICE as fast as loop
(1) (on various CPU architectures, multi and single-core systems). The "value"
variable is a public static volatile to force the compiler to actually
read the buffer's content and store is somewhere. Example timings:

testSimpleGetLoop          : time.bench: 1.97, round: 0.20 [+- 0.00],
round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00
testDirectBufferLoop       : time.bench: 0.57, round: 0.06 [+- 0.01],
round.gc: 0.00 [+- 0.00], GC.calls: 0, GC.time: 0.00

(note the "round" field above, in brackets are standard deviations
from multiple test runs).

I looked at the assembly dumped by the HotSpot and from it I can
conclude that in case of loop (1) the generated code always attempts
to re-read the fields referenced through the list field (it is
object-scoped, private, final,
non-volatile, direct class access -- no interfaces). See listing (A)
below. My understanding of the JMM was that
in cases such as this one, the compiler can safely assume no side
effects for the current thread and move field references to registers.
This is exactly what happens in case (2) (see listing (B) below) --
the references to
local variables are moved to regiters, with additional loop unrolling
performed by the compiler.

My question is if there is anything that prevents the compiler from
caching the "list.buffer" pointer in
version (1) of the loop? Note that get and size methods are inlined
properly, but the field access is still
performed on every loop iteration.

I'd appreciate a hotspot code pointer where this situation is
considered or a JVM reference pointer
which makes the compiler behave as it does. Apologies in advance if
I'm missing something
naively simple.

Dawid

(A) simple get loop, pseudo-assembly dump (after warmup).

030   B4: #     B11 B5 &lt;- B3 B8      Loop: B4-B8 inner  Freq: 220753
030     MOV    EDX,[EBP + #12] ! Field .buffer
033     NullCheck EBP
033
033   B5: #     B12 B6 &lt;- B4  Freq: 220753
033     MOV    EBP,[EDX + #8]
036     NullCheck EDX
036
036   B6: #     B10 B7 &lt;- B5  Freq: 220752
036     CMPu   EDI,EBP
038     Jnb,us B10  P=0.000001 C=-1.000000
038
03a   B7: #     B13 B8 &lt;- B6  Freq: 220752
03a     MOVSX8 EAX,[EDX + #12 + EDI]    # byte
03f     MEMBAR-release ! (empty encoding)
03f     MOV8   [ECX + precise klass Benchmark:
0x048e80e8:Constant:exact *],EAX ! Field  Volatile value
045     MEMBAR-volatile (unnecessary so empty encoding)
045     LOCK ADDL [ESP + #0], 0 ! membar_volatile
04a     MOV    EBP,[EBX + #12] ! Field .list
04d     MOV    EDX,[EBP + #8]   # int ! Field .elementsCount
050     NullCheck EBP
050
050   B8: #     B4 B9 &lt;- B7  Freq: 220752
050     INC    EDI
051     TSTL   #polladdr,EAX    ! Safepoint: poll for GC
057     CMP    EDI,EDX
059     Jl,s  B4  P=1.000000 C=179200.000000

(B) direct buffer access loop (after warmup).

01a     MOV    ECX,[ECX + #12] ! Field .list
01d     MOV    EAX,[ECX + #8]   # int ! Field .elementsCount
020     NullCheck ECX
020
020   B2: #     B12 B3 &lt;- B1  Freq: 0.999999
020     TEST   EAX,EAX
022     Jle    B12  P=0.000000 C=1.000000
022
028   B3: #     B4 &lt;- B2  Freq: 0.999999
028     MOV    EDI,[ECX + #12] ! Field .buffer
02b     XOR    EBX,EBX
02d     MOV    EBP,#360
02d
032   B4: #     B14 B5 &lt;- B3 B6      Loop: B4-B6 inner stride: not constant
pre of N158 Freq: 1.99999
032     MOV    EDX,[EDI + #8]
035     NullCheck EDI
035
035   B5: #     B13 B6 &lt;- B4  Freq: 1.99999
035     CMPu   EBX,EDX
037     Jnb,u  B13  P=0.000001 C=-1.000000
037
03d   B6: #     B4 B7 &lt;- B5  Freq: 1.99999
03d     MOVSX8 ECX,[EDI + #12 + EBX]    # byte
042     MEMBAR-release ! (empty encoding)
042     MOV8   [EBP + precise klass : 0x04161d30:Constant:exact *],ECX
! Field  Volatile.value
048     MEMBAR-volatile (unnecessary so empty encoding)
048     LOCK ADDL [ESP + #0], 0 ! membar_volatile
04d     INC    EBX
04e     CMP    EBX,#1
051     Jl,s  B4        # Loop end  P=0.500000 C=334848.000000

053   B7: #     B9 B8 &lt;- B6  Freq: 0.999994
053     MOV    ESI,EAX
055     MIN    ESI,EDX
05b     SUB    ESI,EBX
05d     AND    ESI,#-4
060     ADD    ESI,EBX
062     CMP    EBX,ESI
064     Jge,s  B9  P=0.000001 C=-1.000000
       NOP     # 10 bytes pad for loops and calls

070   B8: #     B8 B9 &lt;- B7 B8       Loop: B8-B8 inner stride: not constant
main of N116 Freq: 999993
070     MOVSX8 ECX,[EDI + #12 + EBX]    # byte
075     MEMBAR-release ! (empty encoding)
075     MOV8   [EBP + precise klass : 0x04161d30:Constant:exact *],ECX
! Field  Volatile.value
07b     MEMBAR-volatile (unnecessary so empty encoding)
07b     MEMBAR-volatile (unnecessary so empty encoding)
07b     MOVSX8 ECX,[EDI + #13 + EBX]    # byte
080     MEMBAR-release ! (empty encoding)
080     MOV8   [EBP + precise klass : 0x04161d30:Constant:exact *],ECX
! Field  Volatile.value
086     MEMBAR-volatile (unnecessary so empty encoding)
086     MEMBAR-volatile (unnecessary so empty encoding)

... (loop further unrolled here) ...