SuperWord optimization

Fri Feb 27 10:16:47 PST 2009

So somehow I blew away the original version of this though I still had  
the pieces that were used to make it, so I spent some time recreating  
it.  I haven't finished all the ad file changes but it's working and  
spilling 128 bit values into properly aligned stack slots.  There's a  
new ideal type RegQ for quads and a TypeQuad to go along with it.  The  
XMM registers are properly represented at 128 bits wide and the  
regmasks have been modified to support aligned 128 bit registers.  The  
heap alignment issues still aren't resolved so for the momemt I'm  
using the unaligned variants for heap access to allow the code to  
run.  It could run with the aligned ones on Nehalem since unaligned  
memory accesses are fast there.  The stack accesses are still using  
the aligned versions though.  The ad file changes are mostly  
incomplete but they are reasonably straightforward.  x86_32.ad is the  
only one which is even vaguely complete.  The most complex set of  
changes are in the register allocator and type system and those are  
mostly the way they should be.

One problem with superword that these changes highlight is that it  
doesn't really support multiple vector sizes.  Currently you have to  
pick the size up front and all the code generation has to reflect that  
choice.  To really add 128 bit vector operations I think superword has  
to be changed to support multiple vector sizes.  Otherwise we may miss  
opportunities with smaller vectors or smaller arrays where the  
overhead of the larger vector would be greater.  This would mean that  
the vector nodes would have to be told their size when they were  
created and the ad file would decide how big the vector operation was  
based on the ideal reg type used.

As far as the alignment issues, one platforms like Nehalem where  
unaligned accesses are as fast as aligned we could simply skip that  
step.  On other platforms we'd need to adjust the alignment preloop to  
be address based instead of index based.  I don't think this is too  
hard.  I guess I'm not holding my breath for 16 byte alignment in the  
GC.

No one is actually working on this code the moment, though it's  
possible the quad support might go in separately.

The full webrev includes support for all the vector math ops but it  
has only been lightly tested on x86_32.  The vector math op support is  
pretty much exactly what Ross left behind and the rest is all me.   
Anyway, the very large and very rough webrev is at http://cr.openjdk.java.net/~never/quad 
  for those interested.

tom

On Jan 5, 2009, at 12:57 PM, John Rose wrote:

> I'd love to see that webrev.  (You've got some great back-burner  
> stuff, Tom!)
>
> An alternative to adding pointer alignment logic (in the compiler  
> and also the hand-tuned assembly code) is allocating all arrays  
> above some fixed length (e.g., 10 words) with an additional  
> alignment constraint.
>
> This has the advantage of making any pair of large-enough arrays  
> mutually aligned, if their access indexes are aligned.  The compiler  
> and assembly stubs already have special paths for very short arrays;  
> these cut-outs could be adapted to take into account the strong  
> alignment size.  Or we could just unswitch the the whole loop (or  
> predicate it with an uncommon trap, given the right length profile  
> data).
>
> -- John
>
> P.S.  In general, if there's an optimization that makes sense mainly  
> for large arrays, the JVM has the option of allocating large arrays  
> with special tactics (alignment, chunking, multianewarray layout,  
> cache line coloring, CPU affinity for work-stealing, etc., etc.).   
> This option is not yet exercised, except in the simple case of slow- 
> pathing truly huge arrays, bigger than FastAllocateSizeLimit.  This  
> class of optimizations gets more valuable as CPU-to-memory distances  
> increase, but it is still waiting for the right PhD student to come  
> along.
>
>
> On Jan 5, 2009, at 11:39 AM, Tom Rodriguez wrote:
>
>> I have a workspace that combines Ross's initial work supporting  
>> arbitrary ALU operations for vectorization with some work I did for  
>> adding a new RegQ type for 128-bit vector operations.  It was  
>> limping along when I stopped looking at it but I could post a  
>> webrev of it if there is interest.  It's not something I'm working  
>> on but I'd wanted to bring the parts forward so they didn't get  
>> completely lost.  All the  code generation pieces were working and  
>> the main remaining piece is fixing the loop alignment code to  
>> support aligning to 128 bit boundaries.  The heap is only aligned  
>> to 64-bits so the alignment code that superword uses needs to  
>> switch to a pointer alignment calculation instead of using index  
>> alignment.
>>
>> Once we add 128 bit vectors we'll need some policy work to choose  
>> vector sizes based on the code we see.  Currently the code assumes  
>> there is only one vector register size.