RFR: JDK-8308994: C2: Re-implement experimental post loop vectorization

Emanuel Peter epeter at openjdk.org
Fri Jun 23 09:14:07 UTC 2023


On Wed, 21 Jun 2023 08:25:03 GMT, Pengfei Li <pli at openjdk.org> wrote:

>> ## TL;DR
>> 
>> This patch completely re-implements C2's experimental post loop vectorization for better stability, maintainability and performance. Compared with the original implementation, this new implementation adds a standalone loop phase in C2's ideal loop phases and can vectorize more post loops. The original implementation and all code related to multi-versioned post loops are deleted in this patch. More details about this patch can be found in the document replied in this pull request.
>
> ## Background & Problems
> 
> Post loop vectorization takes advantage of vector mask (predicate) features of some hardware platforms, such as x86 AVX-512 and AArch64 SVE, to vectorize tail iterations of loops for better performance. The existing implementation in the C2 compiler has a long history. It was first implemented in [JDK-8153998](https://bugs.openjdk.org/browse/JDK-8153998) in 2016 under a C2's experimental feature PostLoopMultiversioning to support x86 AVX-512 vector masks. Due to insufficient maintenance, it had been broken for a very long time. Last year, We took over [JDK-8183390](https://bugs.openjdk.org/browse/JDK-8183390) to fix and re-enable this feature. Several issues were fixed and AArch64 vector mask support was added at that time. As we proposed to make post loop vectorization non-experimental in future JDK releases, we did some stress tests early in this year but found more problems inside. The problems include stability, maintainability and performance.
> 
> 1. Stability
> Multiple C2 crash or mis-compilation issues related to post loop vectorization were filed on JBS, including [JDK-8301657](https://bugs.openjdk.org/browse/JDK-8301657), [JDK-8301904](https://bugs.openjdk.org/browse/JDK-8301904), [JDK-8301944](https://bugs.openjdk.org/browse/JDK-8301944), [JDK-8304774](https://bugs.openjdk.org/browse/JDK-8304774), [JDK-8308949](https://bugs.openjdk.org/browse/JDK-8308949) and perhaps more with recent C2 patches.
> 
> 2. Maintainability
> The original implementation is based on multi-versioned post loops and the code is mixed in SuperWord. But post loop vectorization does not actually use the SLP algorithm. So there is a lot of special handling for post loops in current SuperWord code. As more and more features are added in SuperWord, the legacy code is becoming more and more difficult to maintain and extend.
> 
> 3. Performance
> Post loop vectorization was expected to bring obvious performance benefit for small iteration loops. But JMH tests showed it didn't. A main reason is that the multi-versioned vector post loop is jumped over from main loop's minimum-trip guard if the whole loop has very few iterations (read [JDK-8307084](https://bugs.openjdk.org/browse/JDK-8307084) to learn more). The previous implementation also has limited vectorization ability, such as it can only vectorize loop statements with single data size.
> 
> ## About this patch
> 
> The main idea of post loop vectorization is widening scalar operations in the post loop and adding vector mask...

@pfustc Thanks already for the PR description and graphs! I'm going to look at this today, and give you some preliminary feedback. At a first glance it looks quite good, I'm especially happy that you moved things outside of SuperWord 😊

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14581#issuecomment-1603978076


More information about the hotspot-dev mailing list