RFR: 8308994: C2: Re-implement experimental post loop vectorization

Tue Jul 4 08:50:14 UTC 2023

On Thu, 29 Jun 2023 10:54:29 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> Hi @eme64,
>> 
>> I guess you have done your first round of review. @fg1417 and I really appreciate all your constructive inputs. By reading your comments, I believe you have reviewed this patch in very detail. Thanks again!
>> 
>> What I am doing now:
>> 
>> - I'm trying to fix the issues which I think can be fixed immediately.
>> - I'm trying to answer all your simple questions ASAP.
>> 
>> For your request of big refactoring work, I feel like I personally may not have enough time and capability to complete it in a short time. We may need some discussion about it. But it's great to know more about your "hybrid vectorizer" plan from your feedback. It looks like a grand plan, and requires significant effort and cooperation. I strongly agree that we need some conversation to discuss where we should move forward and what we can cooperate. Could you give us a moment to digest your idea before we schedule a conversation?
>> 
>> BTW: What's your preferred time for a conversation? We are based in Shanghai (GMT+8)
>
> Hi @pfustc !
> 
> I'm grad you appreciate my review.
> 
>> For your request of big refactoring work, I feel like I personally may not have enough time and capability to complete it in a short time.
> 
> Are you under some time constraint? No pressure from my side, take the time you need.
> 
> I would very much love to have a conversation over a video call with you. I think that would be beneficial for all of us. The problem from our side (Oracle) are intellectual property concerns. OpenJDK emails and PR's are all under the Oracle Contributor Agreement. So there I'm free to have conversations. I'm trying to figure out if we can have a similar frame for a video call, sadly it may take a few weeks or months to get that sorted, as many people are on summer vacation.
> 
> Please take some time to digest the feedback. This is a big change set, it will take a while to be ready for integration at any rate. And again, I would really urge you to consider some refactoring of SuperWord in a separate RFE before this change here.
> 
> I'm looking forward to more collaboration - over PR comments, emails, and hopefully eventually video calls as well 😃 
> Emanuel

Hi @eme64,

In commit 2, I have fixed all simple issues according to your comments and marked them "resolved". And we may spend more time then on the remaining unresolved issues. Now I'd like to answer more questions in your overall feedback.

> You could not mask all instructions, just loads and stores. But do you really need to mask all other instructions too? I guess not if they do not have side-effects, right? Adding avx/avx2 would unlock this feature for many more intel machines.

Besides loads and stores, vector reductions also need to be masked because they do have side-effect (only active lanes should be involved in reduction operations). But yes, reduction support is already excluded from this patch because of performance. Perhaps in the future we can consider transforming reductions to non-reductions (just like what you did recently in SuperWord) to get better performance. In commit 2, I have updated this code according to your suggestion. Thanks for it! Regarding avx/avx2 support, I'm afraid we don't have enough knowledge and test resources of x86. We may need Intel's help if we want to do this.

> Indexing arrays: From my experiments, I have to conclude that you only allow simple indexing of the form a[i], no offset or scaling a[i*2 + 3].

Scaling support is excluded from this patch because they're actually "strided accesses". We cannot get better performance for them with current gather/scatter nodes in C2. But, offset support is in (except the only special case of `iv + stride` which you experimented). You may try other cases like `a[i + 2]` or `a[i - 3]` and see they are vectorized.

> Why not use this for main-loops?

As I have mentioned above, we have experimented more than what we do in this patch, including reductions, strided accesses, type conversions and for normal (unsplit) loops. Indeed, at the beginning, we used this for normal loops and did vectorization before C2's loop iteration split - this is the ideal SVE-style vectorization on AArch64 as the generated code is very elegant. But unfortunately, that performance result is not as good as we expected, and it added a lot of complexity because we need to be make it compatible with C2's loop strip-mining. Later, we turned to use this for post loops. It does reduce a lot of complicity and show better performance (at least on all SVE CPUs we have) now. Using this for main loops is still in our long-term plan, but not in short-term because current SuperWord can do it well.

> What about a hybrid vectorizer?

While working on this patch, we were also thinking about how to make this co-exist with SuperWord. That's a big question! But unfortunately, current SuperWord code is quite convoluted with heavy historical burden and very few people in the JDK community were interested in this before. And for a long time, we have been hoping someone would refactor it. But before you, nobody seems to have the interest or ambition to do it. We now are very glad to know that you have such "hybrid" plan. But I think, how to refactor current SuperWord code also depends on what the "hybrid vectorizer" eventually looks like. We may need more discussions in the future about the direction of refactoring. Looking forward to having a video call with you.

Thanks,
Pengfei

-------------

PR Comment: https://git.openjdk.org/jdk/pull/14581#issuecomment-1619814602