Request for reviews (XL): 7119644: Increase superword's vector size up to 256 bits

Mon Apr 9 17:54:36 PDT 2012

Tom Rodriguez wrote:
> VecX and VecY aren't great names.  Why not do Op_Vec{4,8,16,32} for these instead?  Maybe it needs to a trailing identifier like B to indicate the sizing?

I known, but code in adlc is a lot simpler this way. For example, in archDesc.cpp:

   // Match Vector types.
   if (strncmp(idealOp, "Vec",3)==0) {
     switch(last_char) {
     case 'S':  return "TypeVect::VECTS";
     case 'D':  return "TypeVect::VECTD";
     case 'X':  return "TypeVect::VECTX";
     case 'Y':  return "TypeVect::VECTY";
     default:
       internal_err("Vector type %s with unrecognized type\n",idealOp);
     }
   }

I don't think Vec32B is better then VecY. And I don't want VecQ and VecO which 
are difficult to distinguish.

> 
> sparc.ad:
> 
> apart from moving the code around and the new predicates, did the instruction definitions change in any other way?

I did not change existed Replicate instructions. I changed a little to simplify 
replicate_immI() code which constructs an initialization vector placed into 
constant table. I also added replication for Float values and replaced type 
specific Load instruction with one loadV8 instruction.

> 
> x86.ad:
> 
> Please make sure all that  %{ are on the preceding lines instead of by themselves.

Fixed (Vec* operands had it).

> 
> Is is possible to share the new spill copy code in here instead of duplicating it?

Yes, it looks like possible (just tried). The stack-stack spills are different 
but xmm-xmm and xmm-stack could be moved.

> 
> chaitin.cpp:
> 
> Why do we need the machine dependent ShouldNotReachHeres?  Aren't the asserts enough?

I removed them.

> 
> Can't this:
> 
>         if( lrg.num_regs() == 1 ) {
>           reg = tempmask.find_first_elem();
> +       } else if (lrg._is_vector) {
> +         tempmask.clear_to_sets(lrg.num_regs());
> +         reg = tempmask.find_first_set(lrg.num_regs());
>         } else {
> !         tempmask.ClearToPairs();
> !         tempmask.clear_to_pairs();
>           reg = tempmask.find_first_pair();
>         }
> 
> be simplified to this:
> 
>         tempmask.clear_to_sets(lrg.num_regs());
>         reg = tempmask.find_first_set(lrg.num_regs());
> 
> same with the other uses and the Clear/Insert/set_mask_size sequence.

Yes, it could be done. I tested it by calling *_sets() methods from *_pairs() 
methods. But I thought it is too aggressive code change :)
If you OK with such change I will do it.

> 
> matcher.cpp:
> 
> this is kind of gross.
> 
> + #if defined(IA32) || defined(AMD64)
> +   Op_VecS, Op_VecD, Op_VecX, Op_VecY, /* Vectors */
> + #else
> +   0, Op_RegD, 0, 0, /* Vectors */
> + #endif
> 
> can't the other platforms be required to use Op_VecS instead?

You mean Op_VecD? The reason I did this way is to reduce code changes in .ad 
files on other platforms which will not add new functionality. There is also 
code in RA for pairs misalignment on SPARC which I don't want to duplicate for 
vectors.

> 
> postaloc.cpp:
> 
> Don't you need to generalize this for n_reg > 2:
> 
>     // See if it happens to already be in the correct register!
>     // (either Phi's direct register, or the common case of the name
>     // never-clobbered original-def register)
> !   if( value[val_reg] == val &&
> !   if (value[val_reg] == val &&
>         // Doubles check both halves
> !       ( single || value[val_reg-1] == val ) ) {
> !       ((n_regs == 1) || value[val_reg-1] == val)) {
>       blk_adjust += use_prior_register(n,k,regnd[val_reg],current_block,value,regnd);
>       if( n->in(k) == regnd[val_reg] ) // Success!  Quit trying
>         return blk_adjust;
>     }

Yes, I cut a corner here. I will add correct check for all vector's value[] == val.

> 
> Why aren't you doing copy elimination on vectors?

What do you mean? All changes in postaloc.cpp are done for that to work for 
vectors. I verified that it eliminates MSC where possible. Do you think I missed 
something?

> 
> regmask.cpp:
> 
> the logic in num_registers is too clever.  Please make it return the appropriate values directly.

Done.

> 
> superword.cpp:
> 
> You have an extra copy of the s2 line.
> 
>     set_alignment(s2, align + data_size(s1));
> +   if (align == top_align || align == bottom_align )
> +     set_alignment(s2, align);
> +   else {
> +     set_alignment(s2, align + data_size(s1));
> +   }

Fixed.

> 
> I'm not sure I understand/trust the new logic in find_adjacent_refs.  It's iterating over all the candidate memory ops, collecting sets at potentially different alignments and then setting the overall alignment?  Somehow I don't think this logic will work in the general case or at least it has some hidden limitations.  Can you explain this part more?

New code is added to vectorize operations which have different basic types:

   static void test_IBci(int[] a, byte[] b) {
     for (int i = 0; i < a.length; i+=1) {
       a[i] = -456;
       b[i] = -(byte)123;
     }
   }

0a0   B12: #	B12 B13 <- B11 B12 	Loop: B12-B12 inner main of N90 Freq: 9340.37
0a0   	movslq  R10, R11	# i2l
0a3   	movdqu  [RSI + #16 + R10 << #2],XMM0	! store vector (16 bytes)
0aa   	movq    [RDX + #16 + R10],XMM1	! store vector (8 bytes)
0b1   	movslq  R10, R11	# i2l
0b4   	movdqu  [RSI + #32 + R10 << #2],XMM0	! store vector (16 bytes)
0bb   	addl    R11, #8	# int
0bf   	cmpl    R11, R8
0c2   	jl,s   B12	# loop end  P=0.999893 C=46701.000000

previous code vectorized only one type in such case. New code collects during 
one iteration only related memory operations (as before these changes). Then it 
removes these operations from memops list and tries to collect other related mem 
ops. Such vectors need different loop index alignment since vector sizes are 
different. The new code set maximum loop index alignment. Max alignment also 
works for smaller sizes since sizes are power of 2.

> 
> find_align_to_ref shouldn't be setting _align_to_ref directly.  It should be returning the values it finds and the caller should be responsible for saving off the proper value.

Done.

> 
> I like the simplification of the loads and stores but I end up wondering why the rest of the operations aren't similarly simplified.

They could, but they would need different formats since they use different asm 
instructions for different basic type sizes. But I can definitely merge 
ReplicateS and ReplicateC which have the same size.

> 
> Overall the code looks good.
> 
> tom

Thanks,
Vladimir

> 
> 
> On Apr 3, 2012, at 10:03 AM, Vladimir Kozlov wrote:
> 
>> http://cr.openjdk.java.net/~kvn/7119644/webrev
>>
>> 7119644: Increase superword's vector size up to 256 bits
>>
>> Increase superword's vector size up to 256-bits for YMM AVX registers on x86. Added generation of different vector sizes for different types of arrays in the same loop. Allow to generate small (4 bytes) vectors for loops which were unrolled small number of iterations.
>> Add new C2 types for vectors and rework VectorNode implementation. Used MachTypeNode as base node for vector mach nodes to keep vector type.
>> Moved XMM registers definition and vector instructions into one file x86.ad (have to rename eRegI to rRegI in x86_32.ad).
>>
>> Tested with full CTW, NSK, C2 regression tests, JPRT and added new test.
>>
>> Thanks,
>> Vladimir
>