RFR: JDK-8308994: C2: Re-implement experimental post loop vectorization

Pengfei Li pli at openjdk.org
Sun Jun 25 09:32:09 UTC 2023


On Fri, 23 Jun 2023 15:15:02 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> ## TL;DR
>> 
>> This patch completely re-implements C2's experimental post loop vectorization for better stability, maintainability and performance. Compared with the original implementation, this new implementation adds a standalone loop phase in C2's ideal loop phases and can vectorize more post loops. The original implementation and all code related to multi-versioned post loops are deleted in this patch. More details about this patch can be found in the document replied in this pull request.
>
> I'm in the middle of reviewing, but have to end it here for the week now 😊 
> 
> For now it's a lot of detail-feedback. I'll give a more overall-feedback once I'm done reading through, and reflecting on it.
> 
> Still: this is good work. We will have to discuss the performance benefits vs the code complexity. And maybe we first need to refactor some things to reduce code duplication. But this looks much better than the previous post-loop vectorization.
> 
> Have a great weekend,
> Emanuel

Hi @eme64,

Thank you so much for so many detailed suggestions. As it may take time to address all your comments, I would like say something about our general thoughts first corresponding to your preliminary feedback.

- Regarding duplicated code, it is actually our biggest concern while refactoring the post loop vectorization out from SuperWord. That's also the reason we choose to reuse `SWPointer`. Your suggestion of moving `SWPointer` out is good. But for others, we cannot yet conclude that we should combine similar logics in `SuperWord` and `VectorMaskedLoop` at the moment. Current code does look duplicated as I referred SuperWord code while working on this patch. But their logics are not exactly the same and may diverge as we extend them in the future. For example, `VectorMaskedLoop` is supposed to be able to vectorize loops with some "if-else" conditions (SVE has the ability). If we try to add that support, we may need to change current RPO traversal in `VectorMaskedLoop` because the similar logic in SuperWord only supports single basic block loop. For the same reason, we may need to change the way of finding vector element types if we add vectorization support in `VectorMaskedLoop` for type 
 conversions. Considering doing such code combination is involved, I hope to leave some duplication for now and decide whether to combine them later.

- Regarding the "experimental", we agree that keeping a feature experimental for a long time is bad. Of course, we don't want this re-implementation to be abandoned several years later just like previous `PostLoopMultiversioning` due to lack of test. Feeling unsafe is not the only reason we want to keep it experimental. Today, there are various CPU hardwares that support vector masks (predicates) and their performance on masked vector operations are not consistent. The performance data we showed above are just from the latest generations of x86 and AArch64 CPUs. We are also testing on other hardwares and seeing different results. In the end, we may only enable this feature for some micro-architectures of CPUs where it's beneficial. So in the short term, we propose to keep it experimental and turned off by default. In addition, only a small portion of today's CPUs in the world can get performance benefit from vector masks. So we don't need to rush to make this non-experimental. I thin
 k one or two JDK release is a reasonable time period for such "experimental".

- Regarding the performance, I apologize that I cannot answer all your questions at the moment because we just started the performance evaluation work. But we will get you back later once we get more conclusions.

- Regarding the testing, adding more IR rules is on the way. But as you might have realized, this depends on some other on-going JBS tasks. And how to add new rules depends on the result of our discussions about syncing flags and cpu features in another PR. The original jtreg `TestRangeCheckEliminationDisabled.java` is too simple and just checks the compatibility of two VM options. So I suspect it is no longer necessary after `PostLoopMultiversioning` is removed.

Thanks again for all your feedback. I will address the detailed comments one by one later.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14581#issuecomment-1605988026


More information about the hotspot-compiler-dev mailing list