WIP: auto-vectorization for Graal

Tue Jul 2 15:14:24 UTC 2019

Hello graal-dev,

Twitter has been using Graal in production for several years. Being a Scala
shop, Graal has allowed us to more easily create optimizations that are
targeted at our specific stack. If you’re curious about how and why we use
Graal, check out Chris Thalinger’s talk from Scala Days New York 2018:
https://www.youtube.com/watch?v=PtgKmzgIh4c. Despite Graal's success at
Twitter, one way in which it lacks feature parity with C2 is automatic
vectorization. We anticipate improved performance for some of our internal
customers by implementing it.

As a result, we’ve decided to implement automatic vectorization ourselves.
We have produced a prototype implementation that is available here
https://github.com/nvgrw/graal/tree/feature/autovec. We’re continuously
iterating and are open to feedback and suggestions on the design!

Our current approach is largely based on the algorithm described in the SLP
paper (http://groups.csail.mit.edu/cag/slp/SLP-PLDI-2000.pdf), similar to
what C2 does. Any basic block with isomorphic statements is considered for
vectorization, and loops are partially unrolled to produce straight-line
code with isomorphic instructions. We’re hoping to improve the current
implementation with SLP extensions published in academia and a better
heuristic once we have a complete implementation for the basic algorithm.
Due to its prevalence we are currently targeting just AMD64, but we hope to
support all Graal backends in the future.

In addition to hearing the community’s thoughts on our general approach to
the autovectorization problem, we have several outstanding questions for
specifics relating to the current approach.

We introduced a new phase to the LowTier to perform packing, which runs
just after the FixReadsPhase. We were wondering whether this was
necessarily the best location to put a packing phase. Since we added
specific stamps to represent vector packs of integers and floating point
numbers, one reason for packing so late was to avoid having to modify
earlier phases in order to support vectors. Some phases use the
JavaKind/JavaConstant types from JVMCI, which don’t currently support
vector types and would probably require several workarounds to get working.

We also wish to perform partial unrolling of loops in order to expose more
superword-level parallelism. The LoopPartialUnrollPhase seemed like a
perfect candidate, but runs after a LoweringPhase that introduces several
new blocks to loop bodies. The unroll phase is limited to unrolling loops
containing fewer than 3 blocks (
https://mail.openjdk.java.net/pipermail/graal-dev/2019-May/005794.html),
which means that loops containing any form of array access do not get
unrolled at all due to guard lowering. We initially decided to tackle the
problem by adding another loop phase before guards get lowered, which,
unsurprisingly, results in excessive bounds checking. If you have any ideas
for tackling this issue please let us know!

Regards,

@nvgrw and the rest of Twitter’s VM team