[Heads-up] JDK-8308994: C2: Re-implement experimental post loop vectorization
Pengfei Li
Pengfei.Li at arm.com
Mon May 29 03:12:52 UTC 2023
Hi,
I'm writing to let you know that I just filed "JDK-8308994: C2: Re-implement experimental post loop vectorization".
[BACKGROUND]
Current post loop vectorization in the C2 compiler has a long history. It was firstly implemented in JDK-8153998 in 2016 as an experimental feature to support x86 AVX-512 vector masks. Due to insufficient maintenance, it had been broken for a very long time. Last year, I took over JDK-8183390 to fix and re-enable this feature. Several issues were fixed and AArch64 SVE vector mask support was added in the meanwhile. We (Arm) proposed to make post loop vectorization non-experimental in future JDK releases. So early in this year (2023), we did a lot of tests on this but found more problems inside.
[PROBLEMS]
Problems include stability, maintainability and performance.
1) Stability issues
Multiple C2 crash or mis-compilation issues were filed on JBS, including JDK-8301657, JDK-8301904, JDK-8301944, JDK-8304774, JDK-8308949 and perhaps more.
2) Maintainability issue
The original implementation was based on multi-versioned post loops and the logic was mixed in SuperWord. But the algorithm for post loop vectorization is actually *not* SLP. As more and more new features were added in SuperWord, legacy code for post loop vectorization is becoming more and more difficult to maintain.
3) Performance issue
Post loop vectorization was expected to bring performance improvement for small-iteration vectorizable loops. But JMH tests show it doesn't. A main reason is that the vector masked post loop is skipped (not executed) if the loop trip count is small due to zero-trip guard of the main loop. That's a major defect of current multi-versioning framework. (See JDK-8307084 for more details.)
[ACTIONS]
For better stability, maintainability and performance, we now propose to deprecate current multi-versioning framework and completely re-implement the experimental post loop vectorization, for both x86 AVX-512 and AArch64 SVE. Our new proposal is to add a standalone ideal loop phase (outside SuperWord) to do vector mask transformation directly on the original scalar post loop.
We have been working on this internally for a while. So far we have finished a draft patch. I will push the patch for review soon after it passes all tests and becomes polished enough.
--
Thanks,
Pengfei
More information about the hotspot-compiler-dev
mailing list