RFR: 8342095: Add autovectorizer support for subword vector casts [v3]
Emanuel Peter
epeter at openjdk.org
Fri May 2 08:48:49 UTC 2025
On Fri, 2 May 2025 05:19:41 GMT, Jasmine Karthikeyan <jkarthikeyan at openjdk.org> wrote:
>> @jaskarth Let me know if there is anything we can help you with here :)
>
> @eme64 Thank you for the comments! I've updated the test and benchmark to be more exhaustive, and applied the suggested changes. For the benchmark, I got these results on my machine:
>
> Baseline Patch
> Benchmark (SIZE) Mode Cnt Score Error Units Score Error Units Improvement
> VectorSubword.byteToChar 1024 avgt 12 252.954 ± 4.129 ns/op 24.219 ± 0.453 ns/op (10.4x)
> VectorSubword.byteToInt 1024 avgt 12 194.707 ± 3.584 ns/op 38.353 ± 0.637 ns/op (5.07x)
> VectorSubword.byteToLong 1024 avgt 12 73.645 ± 1.418 ns/op 70.521 ± 0.470 ns/op (no change)
> VectorSubword.byteToShort 1024 avgt 12 252.647 ± 3.738 ns/op 22.664 ± 0.449 ns/op (11.1x)
> VectorSubword.charToByte 1024 avgt 12 236.396 ± 3.893 ns/op 228.710 ± 1.967 ns/op (no change)
> VectorSubword.charToInt 1024 avgt 12 179.673 ± 2.811 ns/op 173.764 ± 1.150 ns/op (no change)
> VectorSubword.charToLong 1024 avgt 12 184.867 ± 3.079 ns/op 177.999 ± 1.312 ns/op (no change)
> VectorSubword.charToShort 1024 avgt 12 24.385 ± 1.822 ns/op 22.375 ± 1.980 ns/op (no change)
> VectorSubword.intToByte 1024 avgt 12 190.949 ± 1.475 ns/op 49.376 ± 1.383 ns/op (3.86x)
> VectorSubword.intToChar 1024 avgt 12 182.862 ± 3.708 ns/op 44.344 ± 4.513 ns/op (4.12x)
> VectorSubword.intToLong 1024 avgt 12 76.072 ± 1.153 ns/op 73.382 ± 0.294 ns/op (no change)
> VectorSubword.intToShort 1024 avgt 12 184.362 ± 1.938 ns/op 45.556 ± 3.323 ns/op (4.04x)
> VectorSubword.longToByte 1024 avgt 12 150.766 ± 3.475 ns/op 146.651 ± 0.742 ns/op (no change)
> VectorSubword.longToChar 1024 avgt 12 121.764 ± 1.323 ns/op 117.068 ± 1.891 ns/op (no change)
> VectorSubword.longToInt 1024 avgt 12 83.761 ± 2.140 ns/op 82.084 ± 0.930 ns/op (no change)
> VectorSubword.longToShort 1024 avgt 12 132.293 ± 23.046 ns/op 115.883 ± 0.834 ns/op (+ 12.4%)
> VectorSubword.shortToByte 1024 avgt 12 253.387 ± 5.972 ns/op 27.591 ± 1.311 ns/op (9.18x)
> VectorSubword.shortToChar 1024 avgt 12 21.446 ± 1.914 ns/op 20.608 ± 1.593 ns/op (no change)
> VectorSubword.shortToInt 1024 avgt 12 187.109 ± 3.372 ns/op 36.818 ± 0.989 ns/op (5.08x)
> VectorSubword.shortToLong 1024 avgt 12 75.448 ± 0.930 ns/op 72.835 ± 0.507 ns/op (no change)
>
> Interestingly, eve...
@jaskarth I ran it with `perf`:
`make test TEST="micro:VectorSubword.charToByte" CONF=linux-x64 TEST_VM_OPTS="-XX:-UseSuperWord" MICRO="OPTIONS=-prof perfasm"`
No SuperWord, about 90%+ of the time are spent in the main loop, 8x unrolled:
1.13% ↗│ 0x00007fc0b01d79b0: movslq %r11d,%r14 ;*bastore {reexecute=0 rethrow=0 return_oop=0}
││ ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 22 (line 76)
││ ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)
2.44% ││ 0x00007fc0b01d79b3: movzwl 0x10(%rdi,%r14,2),%r10d ;*caload {reexecute=0 rethrow=0 return_oop=0}
││ ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 20 (line 76)
││ ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)
2.73% ││ 0x00007fc0b01d79b9: mov %r10b,0x10(%r9,%r14,1) ;*bastore {reexecute=0 rethrow=0 return_oop=0}
││ ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 22 (line 76)
││ ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)
5.89% ││ 0x00007fc0b01d79be: movzwl 0x1e(%rdi,%r14,2),%r10d
2.35% ││ 0x00007fc0b01d79c4: movzwl 0x1c(%rdi,%r14,2),%esi
1.51% ││ 0x00007fc0b01d79ca: movzwl 0x1a(%rdi,%r14,2),%r8d
5.34% ││ 0x00007fc0b01d79d0: movzwl 0x18(%rdi,%r14,2),%edx
1.16% ││ 0x00007fc0b01d79d6: movzwl 0x16(%rdi,%r14,2),%ebx
1.77% ││ 0x00007fc0b01d79dc: movzwl 0x14(%rdi,%r14,2),%ebp
2.51% ││ 0x00007fc0b01d79e2: movzwl 0x12(%rdi,%r14,2),%eax ;*caload {reexecute=0 rethrow=0 return_oop=0}
││ ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 20 (line 76)
││ ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)
5.73% ││ 0x00007fc0b01d79e8: mov %al,0x11(%r9,%r14,1)
13.70% ││ 0x00007fc0b01d79ed: mov %bpl,0x12(%r9,%r14,1)
6.72% ││ 0x00007fc0b01d79f2: mov %bl,0x13(%r9,%r14,1)
9.84% ││ 0x00007fc0b01d79f7: mov %dl,0x14(%r9,%r14,1)
6.01% ││ 0x00007fc0b01d79fc: mov %r8b,0x15(%r9,%r14,1)
6.05% ││ 0x00007fc0b01d7a01: mov %sil,0x16(%r9,%r14,1)
5.24% ││ 0x00007fc0b01d7a06: mov %r10b,0x17(%r9,%r14,1) ;*bastore {reexecute=0 rethrow=0 return_oop=0}
││ ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 22 (line 76)
││ ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)
││ ; {other}
11.16% ││ 0x00007fc0b01d7a0b: add $0x8,%r11d ;*iinc {reexecute=0 rethrow=0 return_oop=0}
││ ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 23 (line 75)
││ ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)
0.23% ││ 0x00007fc0b01d7a0f: cmp %ecx,%r11d
╰│ 0x00007fc0b01d7a12: jl 0x00007fc0b01d79b0 ;*goto {reexecute=0 rethrow=0 return_oop=0}
│ ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 26 (line 75)
│ ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)
With SuperWord this happens:
2.91% ↗ 0x00007f5f801d6d71: mov %r11d,%r8d ;*aload_0 {reexecute=0 rethrow=0 return_oop=0}
│ ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 10 (line 76)
│ ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)
0.78% │ 0x00007f5f801d6d74: vmovd %r8d,%xmm4
0.81% │ 0x00007f5f801d6d79: movslq %r8d,%r14 ;*bastore {reexecute=0 rethrow=0 return_oop=0}
│ ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 22 (line 76)
│ ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)
│ 0x00007f5f801d6d7c: movzwl 0x10(%r13,%r14,2),%r10d ;*caload {reexecute=0 rethrow=0 return_oop=0}
│ ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 20 (line 76)
│ ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)
9.42% │ 0x00007f5f801d6d82: mov %r10b,0x10(%rbp,%r14,1) ;*bastore {reexecute=0 rethrow=0 return_oop=0}
│ ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 22 (line 76)
│ ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)
4.40% │ 0x00007f5f801d6d87: movzwl 0x4e(%r13,%r14,2),%r10d
1.10% │ 0x00007f5f801d6d8d: vmovd %r10d,%xmm3
0.16% │ 0x00007f5f801d6d92: movzwl 0x4c(%r13,%r14,2),%r10d
0.29% │ 0x00007f5f801d6d98: vmovd %r10d,%xmm6
2.14% │ 0x00007f5f801d6d9d: movzwl 0x4a(%r13,%r14,2),%r10d
│ 0x00007f5f801d6da3: vmovd %r10d,%xmm5
2.17% │ 0x00007f5f801d6da8: movzwl 0x48(%r13,%r14,2),%r10d
0.03% │ 0x00007f5f801d6dae: vmovd %r10d,%xmm8
1.91% │ 0x00007f5f801d6db3: movzwl 0x46(%r13,%r14,2),%r10d
│ 0x00007f5f801d6db9: vmovd %r10d,%xmm7
3.76% │ 0x00007f5f801d6dbe: movzwl 0x44(%r13,%r14,2),%r10d ; {other}
0.19% │ 0x00007f5f801d6dc4: vmovd %r10d,%xmm10
1.07% │ 0x00007f5f801d6dc9: movzwl 0x42(%r13,%r14,2),%r10d
│ 0x00007f5f801d6dcf: vmovd %r10d,%xmm9
1.68% │ 0x00007f5f801d6dd4: movzwl 0x40(%r13,%r14,2),%r10d
│ 0x00007f5f801d6dda: vmovd %r10d,%xmm12
1.81% │ 0x00007f5f801d6ddf: movzwl 0x3e(%r13,%r14,2),%r10d
│ 0x00007f5f801d6de5: vmovd %r10d,%xmm11
2.17% │ 0x00007f5f801d6dea: movzwl 0x3c(%r13,%r14,2),%r10d
0.16% │ 0x00007f5f801d6df0: vmovd %r10d,%xmm14
3.37% │ 0x00007f5f801d6df5: movzwl 0x3a(%r13,%r14,2),%r10d
│ 0x00007f5f801d6dfb: vmovd %r10d,%xmm13
1.98% │ 0x00007f5f801d6e00: movzwl 0x38(%r13,%r14,2),%r10d
│ 0x00007f5f801d6e06: vmovd %r10d,%xmm16
1.59% │ 0x00007f5f801d6e0c: movzwl 0x36(%r13,%r14,2),%r10d
│ 0x00007f5f801d6e12: vmovd %r10d,%xmm15
2.49% │ 0x00007f5f801d6e17: movzwl 0x34(%r13,%r14,2),%r10d
0.13% │ 0x00007f5f801d6e1d: vmovd %r10d,%xmm18
1.94% │ 0x00007f5f801d6e23: movzwl 0x32(%r13,%r14,2),%r10d
│ 0x00007f5f801d6e29: vmovd %r10d,%xmm17
2.43% │ 0x00007f5f801d6e2f: movzwl 0x30(%r13,%r14,2),%r10d
│ 0x00007f5f801d6e35: vmovd %r10d,%xmm20
1.62% │ 0x00007f5f801d6e3b: movzwl 0x2e(%r13,%r14,2),%r10d
0.84% │ 0x00007f5f801d6e41: vmovd %r10d,%xmm19
2.36% │ 0x00007f5f801d6e47: movzwl 0x2c(%r13,%r14,2),%r10d
0.06% │ 0x00007f5f801d6e4d: vmovd %r10d,%xmm22
3.17% │ 0x00007f5f801d6e53: movzwl 0x2a(%r13,%r14,2),%r10d
│ 0x00007f5f801d6e59: vmovd %r10d,%xmm21
2.04% │ 0x00007f5f801d6e5f: movzwl 0x28(%r13,%r14,2),%r10d
│ 0x00007f5f801d6e65: vmovd %r10d,%xmm24
1.42% │ 0x00007f5f801d6e6b: movzwl 0x26(%r13,%r14,2),%r10d
│ 0x00007f5f801d6e71: vmovd %r10d,%xmm23
1.72% │ 0x00007f5f801d6e77: movzwl 0x24(%r13,%r14,2),%esi
0.03% │ 0x00007f5f801d6e7d: movzwl 0x22(%r13,%r14,2),%r10d
0.03% │ 0x00007f5f801d6e83: movzwl 0x20(%r13,%r14,2),%r11d
│ 0x00007f5f801d6e89: movzwl 0x1e(%r13,%r14,2),%r9d
│ 0x00007f5f801d6e8f: movzwl 0x1c(%r13,%r14,2),%r8d
0.03% │ 0x00007f5f801d6e95: movzwl 0x1a(%r13,%r14,2),%ebx
0.39% │ 0x00007f5f801d6e9b: movzwl 0x18(%r13,%r14,2),%ecx
│ 0x00007f5f801d6ea1: movzwl 0x16(%r13,%r14,2),%edx
1.85% │ 0x00007f5f801d6ea7: movzwl 0x14(%r13,%r14,2),%edi
│ 0x00007f5f801d6ead: movzwl 0x12(%r13,%r14,2),%eax ;*caload {reexecute=0 rethrow=0 return_oop=0}
│ ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 20 (line 76)
│ ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)
0.06% │ 0x00007f5f801d6eb3: mov %al,0x11(%rbp,%r14,1)
0.03% │ 0x00007f5f801d6eb8: mov %dil,0x12(%rbp,%r14,1)
│ 0x00007f5f801d6ebd: mov %dl,0x13(%rbp,%r14,1)
│ 0x00007f5f801d6ec2: mov %cl,0x14(%rbp,%r14,1) ; {other}
0.49% │ 0x00007f5f801d6ec7: mov %bl,0x15(%rbp,%r14,1)
│ 0x00007f5f801d6ecc: mov %r8b,0x16(%rbp,%r14,1)
1.78% │ 0x00007f5f801d6ed1: mov %r9b,0x17(%rbp,%r14,1)
0.03% │ 0x00007f5f801d6ed6: mov %r11b,0x18(%rbp,%r14,1)
0.03% │ 0x00007f5f801d6edb: mov %r10b,0x19(%rbp,%r14,1)
│ 0x00007f5f801d6ee0: mov %sil,0x1a(%rbp,%r14,1)
1.88% │ 0x00007f5f801d6ee5: vmovd %xmm23,%r10d
│ 0x00007f5f801d6eeb: mov %r10b,0x1b(%rbp,%r14,1)
2.30% │ 0x00007f5f801d6ef0: vmovd %xmm24,%r10d
0.03% │ 0x00007f5f801d6ef6: mov %r10b,0x1c(%rbp,%r14,1)
0.13% │ 0x00007f5f801d6efb: vmovd %xmm21,%r10d
│ 0x00007f5f801d6f01: mov %r10b,0x1d(%rbp,%r14,1)
0.03% │ 0x00007f5f801d6f06: vmovd %xmm22,%r10d
│ 0x00007f5f801d6f0c: mov %r10b,0x1e(%rbp,%r14,1)
│ 0x00007f5f801d6f11: vmovd %xmm19,%r10d
0.03% │ 0x00007f5f801d6f17: mov %r10b,0x1f(%rbp,%r14,1)
1.85% │ 0x00007f5f801d6f1c: vmovd %xmm20,%r10d
│ 0x00007f5f801d6f22: mov %r10b,0x20(%rbp,%r14,1)
0.16% │ 0x00007f5f801d6f27: vmovd %xmm17,%r10d
│ 0x00007f5f801d6f2d: mov %r10b,0x21(%rbp,%r14,1)
0.03% │ 0x00007f5f801d6f32: vmovd %xmm18,%r10d
│ 0x00007f5f801d6f38: mov %r10b,0x22(%rbp,%r14,1)
│ 0x00007f5f801d6f3d: vmovd %xmm15,%r10d
│ 0x00007f5f801d6f42: mov %r10b,0x23(%rbp,%r14,1)
1.98% │ 0x00007f5f801d6f47: vmovd %xmm16,%r10d
│ 0x00007f5f801d6f4d: mov %r10b,0x24(%rbp,%r14,1)
0.13% │ 0x00007f5f801d6f52: vmovd %xmm13,%r10d
│ 0x00007f5f801d6f57: mov %r10b,0x25(%rbp,%r14,1)
0.10% │ 0x00007f5f801d6f5c: vmovd %xmm14,%r10d
│ 0x00007f5f801d6f61: mov %r10b,0x26(%rbp,%r14,1)
│ 0x00007f5f801d6f66: vmovd %xmm11,%r10d
│ 0x00007f5f801d6f6b: mov %r10b,0x27(%rbp,%r14,1)
2.04% │ 0x00007f5f801d6f70: vmovd %xmm12,%r10d
│ 0x00007f5f801d6f75: mov %r10b,0x28(%rbp,%r14,1)
0.13% │ 0x00007f5f801d6f7a: vmovd %xmm9,%r10d
│ 0x00007f5f801d6f7f: mov %r10b,0x29(%rbp,%r14,1)
1.46% │ 0x00007f5f801d6f84: vmovd %xmm10,%r10d
│ 0x00007f5f801d6f89: mov %r10b,0x2a(%rbp,%r14,1)
0.26% │ 0x00007f5f801d6f8e: vmovd %xmm7,%r10d
0.03% │ 0x00007f5f801d6f93: mov %r10b,0x2b(%rbp,%r14,1)
2.20% │ 0x00007f5f801d6f98: vmovd %xmm8,%r10d
0.19% │ 0x00007f5f801d6f9d: mov %r10b,0x2c(%rbp,%r14,1)
2.30% │ 0x00007f5f801d6fa2: vmovd %xmm5,%r10d
0.03% │ 0x00007f5f801d6fa7: mov %r10b,0x2d(%rbp,%r14,1)
1.23% │ 0x00007f5f801d6fac: vmovd %xmm6,%r10d
0.06% │ 0x00007f5f801d6fb1: mov %r10b,0x2e(%rbp,%r14,1)
2.14% │ 0x00007f5f801d6fb6: vmovd %xmm3,%r10d
│ 0x00007f5f801d6fbb: mov %r10b,0x2f(%rbp,%r14,1) ;*bastore {reexecute=0 rethrow=0 return_oop=0}
│ ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 22 (line 76)
│ ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)
2.69% │ 0x00007f5f801d6fc0: vmovd %xmm4,%r11d ; {other}
│ 0x00007f5f801d6fc5: add $0x20,%r11d ;*iinc {reexecute=0 rethrow=0 return_oop=0}
│ ; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 23 (line 75)
│ ; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)
0.32% │ 0x00007f5f801d6fc9: cmp 0x34(%rsp),%r11d
╰ 0x00007f5f801d6fce: jl 0x00007f5f801d6d71 ;*goto {reexecute=0 rethrow=0 return_oop=0}
; - org.openjdk.bench.vm.compiler.VectorSubword::charToByte at 26 (line 75)
; - org.openjdk.bench.vm.compiler.jmh_generated.VectorSubword_charToByte_jmhTest::charToByte_avgt_jmhStub at 15 (line 190)
What is happening here?
-------------
PR Comment: https://git.openjdk.org/jdk/pull/23413#issuecomment-2846688923
More information about the hotspot-compiler-dev
mailing list