RFR: 8291809: Convert compiler/c2/cr7200264/TestSSE2IntVect.java to IR verification test [v2]

Thu Jan 25 14:19:38 UTC 2024

On Thu, 25 Jan 2024 13:38:20 GMT, Daniel Lundén <dlunden at openjdk.org> wrote:

>> I just checked in my machine (on top of commit fb822e49f2a84423c8fd17db2e95bbdd5e7ec191) and these division tests do seem to vectorize, this is e.g. the innermost loop in `test_divc` right before code emission:
>> 
>> ![test_divc](https://github.com/openjdk/jdk/assets/8792647/129d51c2-a1ad-4d02-ab81-02cd849af36f)
>> 
>> Here are my processor features in case it helps (subset of `lscpu` output):
>> 
>> 
>> Architecture:            x86_64
>>   CPU op-mode(s):        32-bit, 64-bit
>>   Address sizes:         39 bits physical, 48 bits virtual
>>   Byte Order:            Little Endian
>> CPU(s):                  12
>>   On-line CPU(s) list:   0-11
>> Vendor ID:               GenuineIntel
>>   Model name:            Intel(R) Core(TM) i7-9850H CPU @ 2.60GHz
>>     CPU family:          6
>>     Model:               158
>>     Thread(s) per core:  2
>>     Core(s) per socket:  6
>> (...)
>>     Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc
>>                          a cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss 
>>                          ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art
>>                           arch_perfmon pebs bts rep_good nopl xtopology nonstop_
>>                          tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cp
>>                          l vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid ss
>>                          e4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes 
>>                          xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_f
>>                          ault epb invpcid_single ssbd ibrs ibpb stibp ibrs_enhan
>>                          ced tpr_shadow flexpriority ept vpid ept_ad fsgsbase ts
>>                          c_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed ad
>>                          x smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsav
>>                          es dtherm ida arat pln pts hwp hwp_notify hwp_act_windo
>>                          w hwp_epp vnmi md_clear flush_l1d arch_capabilities
>> (...)
>
> Thanks for the clarification @robcasloz and @chhagedorn. I've investigated now, and they do vectorize on my machine as well. I was confused because, before the change below, the IR framework did not register the nodes (wrong vector size of 4 instead of the default of 8). Is that expected, and should we specify something else instead of the catch-all `IRNode.VECTOR_SIZE_ANY`?
> 
> 
>      @Test
> -    @IR(counts = { IRNode.ADD_VI,    "> 0",
> -                   IRNode.RSHIFT_VI, "> 0",
> -                   IRNode.SUB_VI,    "> 0" },
> +    @IR(counts = { IRNode.ADD_VI,    IRNode.VECTOR_SIZE_ANY, "> 0",
> +                   IRNode.RSHIFT_VI, IRNode.VECTOR_SIZE_ANY, "> 0",
> +                   IRNode.SUB_VI,    IRNode.VECTOR_SIZE_ANY, "> 0" },
>          applyIfCPUFeatureOr = {"sse2", "true", "asimd", "true"})
>      void test_divc(int[] a0, int[] a1) {
>          for (int i = 0; i < a0.length; i+=1) {
> @@ -519,9 +519,9 @@ void test_divc(int[] a0, int[] a1) {
>      }
>  
>      @Test
> -    @IR(counts = { IRNode.ADD_VI,    "> 0",
> -                   IRNode.RSHIFT_VI, "> 0",
> -                   IRNode.SUB_VI,    "> 0" },
> +    @IR(counts = { IRNode.ADD_VI,    IRNode.VECTOR_SIZE_ANY, "> 0",
> +                   IRNode.RSHIFT_VI, IRNode.VECTOR_SIZE_ANY, "> 0",
> +                   IRNode.SUB_VI,    IRNode.VECTOR_SIZE_ANY, "> 0" },
>          applyIfCPUFeatureOr = {"sse2", "true", "asimd", "true"})
>      void test_divc_n(int[] a0, int[] a1) {
>          for (int i = 0; i < a0.length; i+=1) {

> @dlunde do you understand what factors determine the length of the vector? Why is the default of IRNode.VECTOR_SIZE_MAX not working?

Perhaps C2 hits the loop unrolling limit? @dlunde you can test this by trying out a large value for `-XX:LoopUnrollLimit`. But even if this turned out to be the case, I would still suggest using `IRNode.VECTOR_SIZE_ANY` rather than forcing a higher loop unroll limit value for the tests.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/17428#discussion_r1466441530