RFR: 8342095: Add autovectorizer support for subword vector casts [v3]

Emanuel Peter epeter at openjdk.org
Fri May 2 08:48:49 UTC 2025


On Fri, 2 May 2025 05:19:41 GMT, Jasmine Karthikeyan <jkarthikeyan at openjdk.org> wrote:

>> @jaskarth Let me know if there is anything we can help you with here :)
>
> @eme64 Thank you for the comments! I've updated the test and benchmark to be more exhaustive, and applied the suggested changes. For the benchmark, I got these results on my machine:
> 
>                                                   Baseline                    Patch
> Benchmark                  (SIZE)  Mode  Cnt    Score    Error  Units    Score   Error  Units  Improvement
> VectorSubword.byteToChar     1024  avgt   12  252.954 ±  4.129  ns/op   24.219 ± 0.453  ns/op  (10.4x)
> VectorSubword.byteToInt      1024  avgt   12  194.707 ±  3.584  ns/op   38.353 ± 0.637  ns/op  (5.07x)
> VectorSubword.byteToLong     1024  avgt   12   73.645 ±  1.418  ns/op   70.521 ± 0.470  ns/op  (no change)
> VectorSubword.byteToShort    1024  avgt   12  252.647 ±  3.738  ns/op   22.664 ± 0.449  ns/op  (11.1x)
> VectorSubword.charToByte     1024  avgt   12  236.396 ±  3.893  ns/op  228.710 ± 1.967  ns/op  (no change)
> VectorSubword.charToInt      1024  avgt   12  179.673 ±  2.811  ns/op  173.764 ± 1.150  ns/op  (no change)
> VectorSubword.charToLong     1024  avgt   12  184.867 ±  3.079  ns/op  177.999 ± 1.312  ns/op  (no change)
> VectorSubword.charToShort    1024  avgt   12   24.385 ±  1.822  ns/op   22.375 ± 1.980  ns/op  (no change)
> VectorSubword.intToByte      1024  avgt   12  190.949 ±  1.475  ns/op   49.376 ± 1.383  ns/op  (3.86x)
> VectorSubword.intToChar      1024  avgt   12  182.862 ±  3.708  ns/op   44.344 ± 4.513  ns/op  (4.12x)
> VectorSubword.intToLong      1024  avgt   12   76.072 ±  1.153  ns/op   73.382 ± 0.294  ns/op  (no change)
> VectorSubword.intToShort     1024  avgt   12  184.362 ±  1.938  ns/op   45.556 ± 3.323  ns/op  (4.04x)
> VectorSubword.longToByte     1024  avgt   12  150.766 ±  3.475  ns/op  146.651 ± 0.742  ns/op  (no change)
> VectorSubword.longToChar     1024  avgt   12  121.764 ±  1.323  ns/op  117.068 ± 1.891  ns/op  (no change)
> VectorSubword.longToInt      1024  avgt   12   83.761 ±  2.140  ns/op   82.084 ± 0.930  ns/op  (no change)
> VectorSubword.longToShort    1024  avgt   12  132.293 ± 23.046  ns/op  115.883 ± 0.834  ns/op  (+ 12.4%)
> VectorSubword.shortToByte    1024  avgt   12  253.387 ±  5.972  ns/op   27.591 ± 1.311  ns/op  (9.18x)
> VectorSubword.shortToChar    1024  avgt   12   21.446 ±  1.914  ns/op   20.608 ± 1.593  ns/op  (no change)
> VectorSubword.shortToInt     1024  avgt   12  187.109 ±  3.372  ns/op   36.818 ± 0.989  ns/op  (5.08x)
> VectorSubword.shortToLong    1024  avgt   12   75.448 ±  0.930  ns/op   72.835 ± 0.507  ns/op  (no change)
> 
> Interestingly, eve...

@jaskarth I ran it with `perf`:

`make test TEST="micro:VectorSubword.charToByte" CONF=linux-x64 TEST_VM_OPTS="-XX:-UseSuperWord" MICRO="OPTIONS=-prof perfasm"`

No SuperWord, about 90%+ of the time are spent in the main loop, 8x unrolled:

   1.13%  ↗│     0x00007fc0b01d79b0:   movslq %r11d,%r14                   ;*bastore {reexecute=0 rethrow=0 return_oop=0}
          ││                                                               ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 22 (line 76)
          ││                                                               ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)
   2.44%  ││     0x00007fc0b01d79b3:   movzwl 0x10(%rdi,%r14,2),%r10d      ;*caload {reexecute=0 rethrow=0 return_oop=0}
          ││                                                               ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 20 (line 76)
          ││                                                               ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)
   2.73%  ││     0x00007fc0b01d79b9:   mov    %r10b,0x10(%r9,%r14,1)       ;*bastore {reexecute=0 rethrow=0 return_oop=0}
          ││                                                               ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 22 (line 76)
          ││                                                               ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)
   5.89%  ││     0x00007fc0b01d79be:   movzwl 0x1e(%rdi,%r14,2),%r10d
   2.35%  ││     0x00007fc0b01d79c4:   movzwl 0x1c(%rdi,%r14,2),%esi
   1.51%  ││     0x00007fc0b01d79ca:   movzwl 0x1a(%rdi,%r14,2),%r8d
   5.34%  ││     0x00007fc0b01d79d0:   movzwl 0x18(%rdi,%r14,2),%edx
   1.16%  ││     0x00007fc0b01d79d6:   movzwl 0x16(%rdi,%r14,2),%ebx
   1.77%  ││     0x00007fc0b01d79dc:   movzwl 0x14(%rdi,%r14,2),%ebp
   2.51%  ││     0x00007fc0b01d79e2:   movzwl 0x12(%rdi,%r14,2),%eax       ;*caload {reexecute=0 rethrow=0 return_oop=0}
          ││                                                               ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 20 (line 76)
          ││                                                               ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)
   5.73%  ││     0x00007fc0b01d79e8:   mov    %al,0x11(%r9,%r14,1)
  13.70%  ││     0x00007fc0b01d79ed:   mov    %bpl,0x12(%r9,%r14,1)
   6.72%  ││     0x00007fc0b01d79f2:   mov    %bl,0x13(%r9,%r14,1)
   9.84%  ││     0x00007fc0b01d79f7:   mov    %dl,0x14(%r9,%r14,1)
   6.01%  ││     0x00007fc0b01d79fc:   mov    %r8b,0x15(%r9,%r14,1)
   6.05%  ││     0x00007fc0b01d7a01:   mov    %sil,0x16(%r9,%r14,1)
   5.24%  ││     0x00007fc0b01d7a06:   mov    %r10b,0x17(%r9,%r14,1)       ;*bastore {reexecute=0 rethrow=0 return_oop=0}
          ││                                                               ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 22 (line 76)
          ││                                                               ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)
          ││                                                               ;   {other}
  11.16%  ││     0x00007fc0b01d7a0b:   add    $0x8,%r11d                   ;*iinc {reexecute=0 rethrow=0 return_oop=0}
          ││                                                               ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 23 (line 75)
          ││                                                               ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)
   0.23%  ││     0x00007fc0b01d7a0f:   cmp    %ecx,%r11d
          ╰│     0x00007fc0b01d7a12:   jl     0x00007fc0b01d79b0           ;*goto {reexecute=0 rethrow=0 return_oop=0}
           │                                                               ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 26 (line 75)
           │                                                               ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)


With SuperWord this happens:

   2.91%  ↗  0x00007f5f801d6d71:   mov    %r11d,%r8d                   ;*aload_0 {reexecute=0 rethrow=0 return_oop=0}
          │                                                            ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 10 (line 76)
          │                                                            ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)
   0.78%  │  0x00007f5f801d6d74:   vmovd  %r8d,%xmm4
   0.81%  │  0x00007f5f801d6d79:   movslq %r8d,%r14                    ;*bastore {reexecute=0 rethrow=0 return_oop=0}
          │                                                            ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 22 (line 76)
          │                                                            ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)
          │  0x00007f5f801d6d7c:   movzwl 0x10(%r13,%r14,2),%r10d      ;*caload {reexecute=0 rethrow=0 return_oop=0}
          │                                                            ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 20 (line 76)
          │                                                            ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)
   9.42%  │  0x00007f5f801d6d82:   mov    %r10b,0x10(%rbp,%r14,1)      ;*bastore {reexecute=0 rethrow=0 return_oop=0}
          │                                                            ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 22 (line 76)
          │                                                            ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)
   4.40%  │  0x00007f5f801d6d87:   movzwl 0x4e(%r13,%r14,2),%r10d
   1.10%  │  0x00007f5f801d6d8d:   vmovd  %r10d,%xmm3
   0.16%  │  0x00007f5f801d6d92:   movzwl 0x4c(%r13,%r14,2),%r10d
   0.29%  │  0x00007f5f801d6d98:   vmovd  %r10d,%xmm6
   2.14%  │  0x00007f5f801d6d9d:   movzwl 0x4a(%r13,%r14,2),%r10d
          │  0x00007f5f801d6da3:   vmovd  %r10d,%xmm5
   2.17%  │  0x00007f5f801d6da8:   movzwl 0x48(%r13,%r14,2),%r10d
   0.03%  │  0x00007f5f801d6dae:   vmovd  %r10d,%xmm8
   1.91%  │  0x00007f5f801d6db3:   movzwl 0x46(%r13,%r14,2),%r10d
          │  0x00007f5f801d6db9:   vmovd  %r10d,%xmm7
   3.76%  │  0x00007f5f801d6dbe:   movzwl 0x44(%r13,%r14,2),%r10d      ;   {other}
   0.19%  │  0x00007f5f801d6dc4:   vmovd  %r10d,%xmm10
   1.07%  │  0x00007f5f801d6dc9:   movzwl 0x42(%r13,%r14,2),%r10d
          │  0x00007f5f801d6dcf:   vmovd  %r10d,%xmm9
   1.68%  │  0x00007f5f801d6dd4:   movzwl 0x40(%r13,%r14,2),%r10d
          │  0x00007f5f801d6dda:   vmovd  %r10d,%xmm12
   1.81%  │  0x00007f5f801d6ddf:   movzwl 0x3e(%r13,%r14,2),%r10d
          │  0x00007f5f801d6de5:   vmovd  %r10d,%xmm11
   2.17%  │  0x00007f5f801d6dea:   movzwl 0x3c(%r13,%r14,2),%r10d
   0.16%  │  0x00007f5f801d6df0:   vmovd  %r10d,%xmm14
   3.37%  │  0x00007f5f801d6df5:   movzwl 0x3a(%r13,%r14,2),%r10d
          │  0x00007f5f801d6dfb:   vmovd  %r10d,%xmm13
   1.98%  │  0x00007f5f801d6e00:   movzwl 0x38(%r13,%r14,2),%r10d
          │  0x00007f5f801d6e06:   vmovd  %r10d,%xmm16
   1.59%  │  0x00007f5f801d6e0c:   movzwl 0x36(%r13,%r14,2),%r10d
          │  0x00007f5f801d6e12:   vmovd  %r10d,%xmm15
   2.49%  │  0x00007f5f801d6e17:   movzwl 0x34(%r13,%r14,2),%r10d
   0.13%  │  0x00007f5f801d6e1d:   vmovd  %r10d,%xmm18
   1.94%  │  0x00007f5f801d6e23:   movzwl 0x32(%r13,%r14,2),%r10d
          │  0x00007f5f801d6e29:   vmovd  %r10d,%xmm17
   2.43%  │  0x00007f5f801d6e2f:   movzwl 0x30(%r13,%r14,2),%r10d
          │  0x00007f5f801d6e35:   vmovd  %r10d,%xmm20
   1.62%  │  0x00007f5f801d6e3b:   movzwl 0x2e(%r13,%r14,2),%r10d
   0.84%  │  0x00007f5f801d6e41:   vmovd  %r10d,%xmm19
   2.36%  │  0x00007f5f801d6e47:   movzwl 0x2c(%r13,%r14,2),%r10d
   0.06%  │  0x00007f5f801d6e4d:   vmovd  %r10d,%xmm22
   3.17%  │  0x00007f5f801d6e53:   movzwl 0x2a(%r13,%r14,2),%r10d
          │  0x00007f5f801d6e59:   vmovd  %r10d,%xmm21
   2.04%  │  0x00007f5f801d6e5f:   movzwl 0x28(%r13,%r14,2),%r10d
          │  0x00007f5f801d6e65:   vmovd  %r10d,%xmm24
   1.42%  │  0x00007f5f801d6e6b:   movzwl 0x26(%r13,%r14,2),%r10d
          │  0x00007f5f801d6e71:   vmovd  %r10d,%xmm23
   1.72%  │  0x00007f5f801d6e77:   movzwl 0x24(%r13,%r14,2),%esi
   0.03%  │  0x00007f5f801d6e7d:   movzwl 0x22(%r13,%r14,2),%r10d
   0.03%  │  0x00007f5f801d6e83:   movzwl 0x20(%r13,%r14,2),%r11d
          │  0x00007f5f801d6e89:   movzwl 0x1e(%r13,%r14,2),%r9d
          │  0x00007f5f801d6e8f:   movzwl 0x1c(%r13,%r14,2),%r8d
   0.03%  │  0x00007f5f801d6e95:   movzwl 0x1a(%r13,%r14,2),%ebx
   0.39%  │  0x00007f5f801d6e9b:   movzwl 0x18(%r13,%r14,2),%ecx
          │  0x00007f5f801d6ea1:   movzwl 0x16(%r13,%r14,2),%edx
   1.85%  │  0x00007f5f801d6ea7:   movzwl 0x14(%r13,%r14,2),%edi
          │  0x00007f5f801d6ead:   movzwl 0x12(%r13,%r14,2),%eax       ;*caload {reexecute=0 rethrow=0 return_oop=0}
          │                                                            ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 20 (line 76)
          │                                                            ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)
   0.06%  │  0x00007f5f801d6eb3:   mov    %al,0x11(%rbp,%r14,1)
   0.03%  │  0x00007f5f801d6eb8:   mov    %dil,0x12(%rbp,%r14,1)
          │  0x00007f5f801d6ebd:   mov    %dl,0x13(%rbp,%r14,1)
          │  0x00007f5f801d6ec2:   mov    %cl,0x14(%rbp,%r14,1)        ;   {other}
   0.49%  │  0x00007f5f801d6ec7:   mov    %bl,0x15(%rbp,%r14,1)
          │  0x00007f5f801d6ecc:   mov    %r8b,0x16(%rbp,%r14,1)
   1.78%  │  0x00007f5f801d6ed1:   mov    %r9b,0x17(%rbp,%r14,1)
   0.03%  │  0x00007f5f801d6ed6:   mov    %r11b,0x18(%rbp,%r14,1)
   0.03%  │  0x00007f5f801d6edb:   mov    %r10b,0x19(%rbp,%r14,1)
          │  0x00007f5f801d6ee0:   mov    %sil,0x1a(%rbp,%r14,1)
   1.88%  │  0x00007f5f801d6ee5:   vmovd  %xmm23,%r10d
          │  0x00007f5f801d6eeb:   mov    %r10b,0x1b(%rbp,%r14,1)
   2.30%  │  0x00007f5f801d6ef0:   vmovd  %xmm24,%r10d
   0.03%  │  0x00007f5f801d6ef6:   mov    %r10b,0x1c(%rbp,%r14,1)
   0.13%  │  0x00007f5f801d6efb:   vmovd  %xmm21,%r10d
          │  0x00007f5f801d6f01:   mov    %r10b,0x1d(%rbp,%r14,1)
   0.03%  │  0x00007f5f801d6f06:   vmovd  %xmm22,%r10d
          │  0x00007f5f801d6f0c:   mov    %r10b,0x1e(%rbp,%r14,1)
          │  0x00007f5f801d6f11:   vmovd  %xmm19,%r10d
   0.03%  │  0x00007f5f801d6f17:   mov    %r10b,0x1f(%rbp,%r14,1)
   1.85%  │  0x00007f5f801d6f1c:   vmovd  %xmm20,%r10d
          │  0x00007f5f801d6f22:   mov    %r10b,0x20(%rbp,%r14,1)
   0.16%  │  0x00007f5f801d6f27:   vmovd  %xmm17,%r10d
          │  0x00007f5f801d6f2d:   mov    %r10b,0x21(%rbp,%r14,1)
   0.03%  │  0x00007f5f801d6f32:   vmovd  %xmm18,%r10d
          │  0x00007f5f801d6f38:   mov    %r10b,0x22(%rbp,%r14,1)
          │  0x00007f5f801d6f3d:   vmovd  %xmm15,%r10d
          │  0x00007f5f801d6f42:   mov    %r10b,0x23(%rbp,%r14,1)
   1.98%  │  0x00007f5f801d6f47:   vmovd  %xmm16,%r10d
          │  0x00007f5f801d6f4d:   mov    %r10b,0x24(%rbp,%r14,1)
   0.13%  │  0x00007f5f801d6f52:   vmovd  %xmm13,%r10d
          │  0x00007f5f801d6f57:   mov    %r10b,0x25(%rbp,%r14,1)
   0.10%  │  0x00007f5f801d6f5c:   vmovd  %xmm14,%r10d
          │  0x00007f5f801d6f61:   mov    %r10b,0x26(%rbp,%r14,1)
          │  0x00007f5f801d6f66:   vmovd  %xmm11,%r10d
          │  0x00007f5f801d6f6b:   mov    %r10b,0x27(%rbp,%r14,1)
   2.04%  │  0x00007f5f801d6f70:   vmovd  %xmm12,%r10d
          │  0x00007f5f801d6f75:   mov    %r10b,0x28(%rbp,%r14,1)
   0.13%  │  0x00007f5f801d6f7a:   vmovd  %xmm9,%r10d
          │  0x00007f5f801d6f7f:   mov    %r10b,0x29(%rbp,%r14,1)
   1.46%  │  0x00007f5f801d6f84:   vmovd  %xmm10,%r10d
          │  0x00007f5f801d6f89:   mov    %r10b,0x2a(%rbp,%r14,1)
   0.26%  │  0x00007f5f801d6f8e:   vmovd  %xmm7,%r10d
   0.03%  │  0x00007f5f801d6f93:   mov    %r10b,0x2b(%rbp,%r14,1)
   2.20%  │  0x00007f5f801d6f98:   vmovd  %xmm8,%r10d
   0.19%  │  0x00007f5f801d6f9d:   mov    %r10b,0x2c(%rbp,%r14,1)
   2.30%  │  0x00007f5f801d6fa2:   vmovd  %xmm5,%r10d
   0.03%  │  0x00007f5f801d6fa7:   mov    %r10b,0x2d(%rbp,%r14,1)
   1.23%  │  0x00007f5f801d6fac:   vmovd  %xmm6,%r10d
   0.06%  │  0x00007f5f801d6fb1:   mov    %r10b,0x2e(%rbp,%r14,1)
   2.14%  │  0x00007f5f801d6fb6:   vmovd  %xmm3,%r10d
          │  0x00007f5f801d6fbb:   mov    %r10b,0x2f(%rbp,%r14,1)      ;*bastore {reexecute=0 rethrow=0 return_oop=0}
          │                                                            ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 22 (line 76)
          │                                                            ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)
   2.69%  │  0x00007f5f801d6fc0:   vmovd  %xmm4,%r11d                  ;   {other}
          │  0x00007f5f801d6fc5:   add    $0x20,%r11d                  ;*iinc {reexecute=0 rethrow=0 return_oop=0}
          │                                                            ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 23 (line 75)
          │                                                            ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)
   0.32%  │  0x00007f5f801d6fc9:   cmp    0x34(%rsp),%r11d
          ╰  0x00007f5f801d6fce:   jl     0x00007f5f801d6d71           ;*goto {reexecute=0 rethrow=0 return_oop=0}
                                                                       ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 26 (line 75)
                                                                       ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)

What is happening here?

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23413#issuecomment-2846688923


More information about the hotspot-compiler-dev mailing list