Request for reviews (XL): 7119644: Increase superword's vector size up to 256 bits

Mon Apr 9 19:35:52 PDT 2012

Tom Rodriguez wrote:
> On Apr 9, 2012, at 5:54 PM, Vladimir Kozlov wrote:
> 
>> Tom Rodriguez wrote:
>>
>> I don't think Vec32B is better then VecY. And I don't want VecQ and VecO which are difficult to distinguish.
> 
> Except that VecY and VecX are fairly meaningless names.

They are not meaningless, they represent YMM and XMM cpu registers. It is not 
general but they are used now only on x86.

>>
>>> be simplified to this:
>>>        tempmask.clear_to_sets(lrg.num_regs());
>>>        reg = tempmask.find_first_set(lrg.num_regs());
>>> same with the other uses and the Clear/Insert/set_mask_size sequence.
>> Yes, it could be done. I tested it by calling *_sets() methods from *_pairs() methods. But I thought it is too aggressive code change :)
>> If you OK with such change I will do it.
> 
> It's aggressive but it should work I think.  I could go either way.

OK. I will go with _sets() and simplified code. And I will rerun tests again 
after that.

> 
>>> matcher.cpp:
>>> this is kind of gross.
>>> + #if defined(IA32) || defined(AMD64)
>>> +   Op_VecS, Op_VecD, Op_VecX, Op_VecY, /* Vectors */
>>> + #else
>>> +   0, Op_RegD, 0, 0, /* Vectors */
>>> + #endif
>>> can't the other platforms be required to use Op_VecS instead?
>> You mean Op_VecD? The reason I did this way is to reduce code changes in .ad files on other platforms which will not add new functionality. There is also code in RA for pairs misalignment on SPARC which I don't want to duplicate for vectors.
> 
> Could those be filled in by asking the ad file instead?

I can move this whole table initialization into .ad file. It works.

>>
>>> Why aren't you doing copy elimination on vectors?
>> What do you mean? All changes in postaloc.cpp are done for that to work for vectors. I verified that it eliminates MSC where possible. Do you think I missed something?
> 
> The block to update the mapping doesn't have the eliminate_copy_of_constant logic or the replace_and_yank_if_dead logic.
> 

Vector value can't be constant (there is no type which represent it). The 
construction of vector from constant value is hidden in Mach replicate node.

But you are right about the case when (n->is_Copy() && value[nreg] == val). I 
need call replace_and_yank_if_dead() in such case.

>>
>> 0a0   B12: #	B12 B13 <- B11 B12 	Loop: B12-B12 inner main of N90 Freq: 9340.37
>> 0a0   	movslq  R10, R11	# i2l
>> 0a3   	movdqu  [RSI + #16 + R10 << #2],XMM0	! store vector (16 bytes)
>> 0aa   	movq    [RDX + #16 + R10],XMM1	! store vector (8 bytes)
>> 0b1   	movslq  R10, R11	# i2l
>> 0b4   	movdqu  [RSI + #32 + R10 << #2],XMM0	! store vector (16 bytes)
>> 0bb   	addl    R11, #8	# int
>> 0bf   	cmpl    R11, R8
>> 0c2   	jl,s   B12	# loop end  P=0.999893 C=46701.000000
>>
>> previous code vectorized only one type in such case. New code collects during one iteration only related memory operations (as before these changes). Then it removes these operations from memops list and tries to collect other related mem ops. Such vectors need different loop index alignment since vector sizes are different. The new code set maximum loop index alignment. Max alignment also works for smaller sizes since sizes are power of 2.
> 
> How does it deal with index variables that might be offset?  Something like this:
> 
>>  static void test_IBci(int[] a, byte[] b) {
>>    for (int i = 0; i < a.length - 1; i+=1) {
>>      a[i] = -456;
>>      b[i + 1] = -(byte)123;
>>    }
>>  }
> 
> 
> It's not obvious to me where that will be weeded out.

0b3   	movq    [RDX + #17 + R10],XMM1	! store vector (8 bytes)

It generates unaligned move. For x86 it does not matter (I used only unaligned 
asm instructions) but for SPARC it is disaster (40 times slow since it traps):

IBci: 2382
IBvi: 65

I will need to add a check (current align vs max align) into find_adjacent_refs 
to vectorize only aligned mem ops on SPARC. I want to keep unaligned mem ops on 
x86 since we can win on vector arithmetic.

Vladimir

> 
> tom
> 
>>> find_align_to_ref shouldn't be setting _align_to_ref directly.  It should be returning the values it finds and the caller should be responsible for saving off the proper value.
>> Done.
>>
>>> I like the simplification of the loads and stores but I end up wondering why the rest of the operations aren't similarly simplified.
>> They could, but they would need different formats since they use different asm instructions for different basic type sizes. But I can definitely merge ReplicateS and ReplicateC which have the same size.
> 
> Right, but that's the same as LoadVector and StoreVector.
> 
>>> Overall the code looks good.
>>> tom
>> Thanks,
>> Vladimir
>>
>>> On Apr 3, 2012, at 10:03 AM, Vladimir Kozlov wrote:
>>>> http://cr.openjdk.java.net/~kvn/7119644/webrev
>>>>
>>>> 7119644: Increase superword's vector size up to 256 bits
>>>>
>>>> Increase superword's vector size up to 256-bits for YMM AVX registers on x86. Added generation of different vector sizes for different types of arrays in the same loop. Allow to generate small (4 bytes) vectors for loops which were unrolled small number of iterations.
>>>> Add new C2 types for vectors and rework VectorNode implementation. Used MachTypeNode as base node for vector mach nodes to keep vector type.
>>>> Moved XMM registers definition and vector instructions into one file x86.ad (have to rename eRegI to rRegI in x86_32.ad).
>>>>
>>>> Tested with full CTW, NSK, C2 regression tests, JPRT and added new test.
>>>>
>>>> Thanks,
>>>> Vladimir
>