Vectorized Loop Unrolling on x64?

Tue Oct 24 17:03:51 UTC 2017

You are right - your initial examples are not supported by current HotSpot JIT vectorization.
Second example (sum/reduction) could be optimized https://bugs.openjdk.java.net/browse/JDK-8074981 but because generated 
code is very expensive we limited it to cases where benefit overweights expense: 
https://bugs.openjdk.java.net/browse/JDK-8078563.

Regards,
Vladimir

On 10/24/17 9:46 AM, Ionut wrote:
> Hello All,
> 
>     Meanwhile I tested two more other scenarios, as follows:
> 
> - a[i] = b[i] + c[i]                    // where a, b, c are arrays of ints
> - a[i] = a[i] + <int_value>      // where <int_value>might be a constant, etc
> 
> In both cases they were vectorized, but my initial example (e.g. iterating through the array of ints and computing the 
> sum of elements) is not ... which makes me think this case is currently not supported by JIT.
> 
> Could you please confirm this?
> 
> Regards
> Ionut
> 
> 
> On Tuesday, October 24, 2017 12:24 PM, Ionut <ionutb83 at yahoo.com> wrote:
> 
> 
> Hi Nils,
> 
>    Thanks, it is clear. However, I have tried a simple example (e.g. just iterating through an array and do the sum 
> using JMH) on my x64 Linux and it seems to not be vectorized ...  Below initial source code and assembly.
> Could you please provide me any hint, am I doing something wrong?
> 
> *JDK is 9.0.1*
> 
> *_Source code:_*
> 
> @BenchmarkMode(Mode.AverageTime)
> @OutputTimeUnit(TimeUnit.NANOSECONDS)
> @Warmup(iterations = 10, time = 1, timeUnit = TimeUnit.NANOSECONDS)
> @Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.NANOSECONDS)
> @Fork(value = 3, jvmArgsAppend = { "-XX:-TieredCompilation", "-Xbatch", "-XX:+UseSuperWord" })
> @State(Scope.Benchmark)
> public class _Sum1ToNArray _{
>      private int[] array;
> 
>      public static void main(String[] args) {
>          Options opt =
>              new OptionsBuilder()
>                  .include(Sum1ToNArray.class.getSimpleName())
>                  .build();
>          new Runner(opt).run();
>      }
> 
>      @Setup(Level.Trial)
>      public void setUp() {
>          this.array = new int[100_000_000];
>          for (int i = 0; i < array.length; i++)
>              array[i] = i + 1;
>      }
> 
>      @Benchmark
>      public long hotMethod() {
>          long sum = 0;
>          for (int i = 0; i < array.length; i++) {
>              sum += array[i];
>          }
>          return sum;
>      }
> }
> 
> *_Assembly:_*
> ....[Hottest Region 1]..............................................................................
> c2, com.jpt.Sum1ToNArray::hotMethod, version 139 (63 bytes)
> 
>                                       0x00007f7bf1bff0f9: mov    r8d,r10d
>                                       0x00007f7bf1bff0fc: add    r8d,0xfffffff9
>                                       0x00007f7bf1bff100: mov    r11d,0x1
>                                       0x00007f7bf1bff106: cmp    r8d,0x1
>                               ╭    0x00007f7bf1bff10a: jg     0x00007f7bf1bff114
>                               │      0x00007f7bf1bff10c: mov    rax,rdx
>                               │╭   0x00007f7bf1bff10f: jmp    0x00007f7bf1bff15d
>                               ││↗  0x00007f7bf1bff111: mov    rdx,rax            ;*lload_1 {reexecute=0 rethrow=0 
> return_oop=0}
>                               │││                                                                       ; - 
> com.jpt.Sum1ToNArray::hotMethod at 13 (line 53)
>                              ↘││  0x00007f7bf1bff114: movsxd rsi,DWORD PTR [r14+r11*4+0x10]
>   11.08%    8.55%    ││  0x00007f7bf1bff119: movsxd rbp,DWORD PTR [r14+r11*4+0x14]
>    0.30%    0.17%     ││  0x00007f7bf1bff11e: movsxd r13,DWORD PTR [r14+r11*4+0x18]
>                                ││  0x00007f7bf1bff123: movsxd rax,DWORD PTR [r14+r11*4+0x2c]
>    8.86%    2.85%     ││  0x00007f7bf1bff128: movsxd r9,DWORD PTR  [r14+r11*4+0x28]
>   10.49%   23.29%   ││  0x00007f7bf1bff12d: movsxd rcx,DWORD PTR [r14+r11*4+0x24]
>    0.38%    0.45%     ││  0x00007f7bf1bff132: movsxd rbx,DWORD PTR [r14+r11*4+0x20]
>    0.03%    0.06%     ││  0x00007f7bf1bff137: movsxd rdi,DWORD PTR [r14+r11*4+0x1c]
>    0.23%    0.22%     ││  0x00007f7bf1bff13c: add    rsi,rdx
>   10.58%   18.59%   ││  0x00007f7bf1bff13f: add    rbp,rsi
>    0.32%    0.17%     ││  0x00007f7bf1bff142: add    r13,rbp
>    0.05%    0.04%     ││  0x00007f7bf1bff145: add    rdi,r13
>   26.10%   28.47%   ││  0x00007f7bf1bff148: add    rbx,rdi
>    5.55%    5.48%     ││  0x00007f7bf1bff14b: add    rcx,rbx
>    5.66%    1.32%     ││  0x00007f7bf1bff14e: add    r9,rcx
>    7.85%    3.11%     ││  0x00007f7bf1bff151: add    rax,r9             ;*ladd {reexecute=0 rethrow=0 return_oop=0}
>                                ││                                                                     ; - 
> com.jpt.Sum1ToNArray::hotMethod at 21 (line 53)
>   10.19%    5.67%    ││  0x00007f7bf1bff154: add    r11d,0x8         ;*iinc {reexecute=0 rethrow=0 return_oop=0}
>                                ││                                                                      ; - 
> com.jpt.Sum1ToNArray::hotMethod at 23 (line 52)
>    0.38%    0.12%     ││  0x00007f7bf1bff158: cmp    r11d,r8d
>                                │╰  0x00007f7bf1bff15b: jl        0x00007f7bf1bff111  ;*if_icmpge {reexecute=0 rethrow=0 
> return_oop=0}
>                                │                                                                                 ; - 
> com.jpt.Sum1ToNArray::hotMethod at 10 (line 52)
>                                ↘   0x00007f7bf1bff15d: cmp    r11d,r10d
>                                     0x00007f7bf1bff160: jge       0x00007f7bf1bff174
>                                     0x00007f7bf1bff162: xchg    ax,ax                      ; *lload_1 {reexecute=0 
> rethrow=0 return_oop=0}
>                                                                                                                ; - 
> com.jpt.Sum1ToNArray::hotMethod at 13 (line 53)
>                                      0x00007f7bf1bff164: movsxd r8,DWORD PTR [r14+r11*4+0x10]
>                                       0x00007f7bf1bff169: add       rax,r8                    ;*ladd {reexecute=0 
> rethrow=0 return_oop=0}
>                                                                                                                 ; - 
> com.jpt.Sum1ToNArray::hotMethod at 21 (line 53)
> 
> Regards
> 
> 
> On Tuesday, October 24, 2017 11:22 AM, Nils Eliasson <nils.eliasson at oracle.com> wrote:
> 
> 
> Hi Ionut,
> In this case x86 refers to both x86_32/ia32 and x86_64/amd64/x64.
> Regards,
> Nils Eliasson
> 
> On 2017-10-24 11:05, Ionut wrote:
>> Hello All,
>>
>>     I want to ask you about https://bugs.openjdk.java.net/browse/JDK-8129920* - Vectorized loop unrolling *which says 
>> it is applicable _only__for x86 targets_. Do you plan to port this for x64 as well? Or I miss something here?
>>
>> Regards
>> Ionut
> 
> 
> 
> 
>