SuperWord optimization
Tom Rodriguez
Thomas.Rodriguez at Sun.COM
Fri Feb 27 10:16:47 PST 2009
So somehow I blew away the original version of this though I still had
the pieces that were used to make it, so I spent some time recreating
it. I haven't finished all the ad file changes but it's working and
spilling 128 bit values into properly aligned stack slots. There's a
new ideal type RegQ for quads and a TypeQuad to go along with it. The
XMM registers are properly represented at 128 bits wide and the
regmasks have been modified to support aligned 128 bit registers. The
heap alignment issues still aren't resolved so for the momemt I'm
using the unaligned variants for heap access to allow the code to
run. It could run with the aligned ones on Nehalem since unaligned
memory accesses are fast there. The stack accesses are still using
the aligned versions though. The ad file changes are mostly
incomplete but they are reasonably straightforward. x86_32.ad is the
only one which is even vaguely complete. The most complex set of
changes are in the register allocator and type system and those are
mostly the way they should be.
One problem with superword that these changes highlight is that it
doesn't really support multiple vector sizes. Currently you have to
pick the size up front and all the code generation has to reflect that
choice. To really add 128 bit vector operations I think superword has
to be changed to support multiple vector sizes. Otherwise we may miss
opportunities with smaller vectors or smaller arrays where the
overhead of the larger vector would be greater. This would mean that
the vector nodes would have to be told their size when they were
created and the ad file would decide how big the vector operation was
based on the ideal reg type used.
As far as the alignment issues, one platforms like Nehalem where
unaligned accesses are as fast as aligned we could simply skip that
step. On other platforms we'd need to adjust the alignment preloop to
be address based instead of index based. I don't think this is too
hard. I guess I'm not holding my breath for 16 byte alignment in the
GC.
No one is actually working on this code the moment, though it's
possible the quad support might go in separately.
The full webrev includes support for all the vector math ops but it
has only been lightly tested on x86_32. The vector math op support is
pretty much exactly what Ross left behind and the rest is all me.
Anyway, the very large and very rough webrev is at http://cr.openjdk.java.net/~never/quad
for those interested.
tom
On Jan 5, 2009, at 12:57 PM, John Rose wrote:
> I'd love to see that webrev. (You've got some great back-burner
> stuff, Tom!)
>
> An alternative to adding pointer alignment logic (in the compiler
> and also the hand-tuned assembly code) is allocating all arrays
> above some fixed length (e.g., 10 words) with an additional
> alignment constraint.
>
> This has the advantage of making any pair of large-enough arrays
> mutually aligned, if their access indexes are aligned. The compiler
> and assembly stubs already have special paths for very short arrays;
> these cut-outs could be adapted to take into account the strong
> alignment size. Or we could just unswitch the the whole loop (or
> predicate it with an uncommon trap, given the right length profile
> data).
>
> -- John
>
> P.S. In general, if there's an optimization that makes sense mainly
> for large arrays, the JVM has the option of allocating large arrays
> with special tactics (alignment, chunking, multianewarray layout,
> cache line coloring, CPU affinity for work-stealing, etc., etc.).
> This option is not yet exercised, except in the simple case of slow-
> pathing truly huge arrays, bigger than FastAllocateSizeLimit. This
> class of optimizations gets more valuable as CPU-to-memory distances
> increase, but it is still waiting for the right PhD student to come
> along.
>
>
> On Jan 5, 2009, at 11:39 AM, Tom Rodriguez wrote:
>
>> I have a workspace that combines Ross's initial work supporting
>> arbitrary ALU operations for vectorization with some work I did for
>> adding a new RegQ type for 128-bit vector operations. It was
>> limping along when I stopped looking at it but I could post a
>> webrev of it if there is interest. It's not something I'm working
>> on but I'd wanted to bring the parts forward so they didn't get
>> completely lost. All the code generation pieces were working and
>> the main remaining piece is fixing the loop alignment code to
>> support aligning to 128 bit boundaries. The heap is only aligned
>> to 64-bits so the alignment code that superword uses needs to
>> switch to a pointer alignment calculation instead of using index
>> alignment.
>>
>> Once we add 128 bit vectors we'll need some policy work to choose
>> vector sizes based on the code we see. Currently the code assumes
>> there is only one vector register size.
More information about the hotspot-dev
mailing list