WIP: auto-vectorization for Graal

Sat Jul 6 17:01:24 UTC 2019

Hi Niklas,

Thanks for sharing your prototype! Are there some measurement of Twitter’s
workloads that improve with this patch? Generally, given Twitter’s large
production deployment of the GraalVM compiler, one main contribution from
the side of #TwitterVMTeam could be sharing JFR data and/or compiler graphs
from test runs.

For larger additions of general optimisations to the GraalVM compiler, there
needs to be an assessment of:
* Peak performance impact
* Compilation time impact
* Compiled code size impact
* Overall complexity and maintenance overhead

Our preference on the first three metrics is to use https://renaissance.dev,
https://dacapobench.org, and for reference also SPECjvm2008. Our latest 19.1
release shows how much impact compilation time can have on medium length
workloads (https://medium.com/graalvm/graalvm-19-1-compiling-faster-a0041066dee4).
One relevant question on the last point is whether this is intended as a one time
contribution or whether #TwitterVMTeam will be committed to help maintaining
the optimisation moving forward.

Regarding the general approach: Yes, removing bounds check is a general
prerequisite for good vectorisation opportunities. I think there should probably be
a lightweight array bounds check elimination algorithm that targets array accesses
in loops. This itself might have quite some positive impact on array-heavy workloads.

For longer discussions on larger patches, our preference is to use either GitHub 
pull requests or GitHub issues. So feel encouraged to create those for the topic.

Regards, thomas

> On 2 Jul 2019, at 17:14, Niklas Vangerow <nvangerow at twitter.com> wrote:
> 
> Hello graal-dev,
> 
> Twitter has been using Graal in production for several years. Being a Scala
> shop, Graal has allowed us to more easily create optimizations that are
> targeted at our specific stack. If you’re curious about how and why we use
> Graal, check out Chris Thalinger’s talk from Scala Days New York 2018:
> https://www.youtube.com/watch?v=PtgKmzgIh4c. Despite Graal's success at
> Twitter, one way in which it lacks feature parity with C2 is automatic
> vectorization. We anticipate improved performance for some of our internal
> customers by implementing it.
> 
> As a result, we’ve decided to implement automatic vectorization ourselves.
> We have produced a prototype implementation that is available here
> https://github.com/nvgrw/graal/tree/feature/autovec. We’re continuously
> iterating and are open to feedback and suggestions on the design!
> 
> Our current approach is largely based on the algorithm described in the SLP
> paper (http://groups.csail.mit.edu/cag/slp/SLP-PLDI-2000.pdf), similar to
> what C2 does. Any basic block with isomorphic statements is considered for
> vectorization, and loops are partially unrolled to produce straight-line
> code with isomorphic instructions. We’re hoping to improve the current
> implementation with SLP extensions published in academia and a better
> heuristic once we have a complete implementation for the basic algorithm.
> Due to its prevalence we are currently targeting just AMD64, but we hope to
> support all Graal backends in the future.
> 
> In addition to hearing the community’s thoughts on our general approach to
> the autovectorization problem, we have several outstanding questions for
> specifics relating to the current approach.
> 
> 
> 
> We introduced a new phase to the LowTier to perform packing, which runs
> just after the FixReadsPhase. We were wondering whether this was
> necessarily the best location to put a packing phase. Since we added
> specific stamps to represent vector packs of integers and floating point
> numbers, one reason for packing so late was to avoid having to modify
> earlier phases in order to support vectors. Some phases use the
> JavaKind/JavaConstant types from JVMCI, which don’t currently support
> vector types and would probably require several workarounds to get working.
> 
> We also wish to perform partial unrolling of loops in order to expose more
> superword-level parallelism. The LoopPartialUnrollPhase seemed like a
> perfect candidate, but runs after a LoweringPhase that introduces several
> new blocks to loop bodies. The unroll phase is limited to unrolling loops
> containing fewer than 3 blocks (
> https://mail.openjdk.java.net/pipermail/graal-dev/2019-May/005794.html),
> which means that loops containing any form of array access do not get
> unrolled at all due to guard lowering. We initially decided to tackle the
> problem by adding another loop phase before guards get lowered, which,
> unsurprisingly, results in excessive bounds checking. If you have any ideas
> for tackling this issue please let us know!
> 
> Regards,
> 
> @nvgrw and the rest of Twitter’s VM team