RFR: 8308994: C2: Re-implement experimental post loop vectorization [v2]

Tue Jul 4 02:40:15 UTC 2023

On Wed, 28 Jun 2023 10:41:12 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> Pengfei Li has updated the pull request incrementally with one additional commit since the last revision:
>> 
>>   Address part of comments from Emanuel
>
> src/hotspot/share/opto/vmaskloop.cpp line 89:
> 
>> 87:   cl->mark_loop_vectorized();
>> 88:   cl->mark_vector_masked();
>> 89:   _phase->C->set_max_vector_size(MaxVectorSize);
> 
> What is this for?

On AArch64 with SVE, we pre-initialize an all-true register (p7) only in compiled methods where `MaxVectorSize` is set. So if a compiled method implicitly uses ptrue (because it's vectorized), we need to set the `MaxVectorSize` to guarantee p7 is initialized. This usage starts since the initial SVE patch (in 2020). We know this is not a perfect solution and @fg1417 is currently investigating if we have better solutions.

> src/hotspot/share/opto/vmaskloop.cpp line 531:
> 
>> 529:   if (!addp->is_AddP() || !operates_on_array_of_type(addp, mem_type)) {
>> 530:     return nullptr;
>> 531:   }
> 
> I guess this prevents you from having `Unsafe` use type mismatched loads/stores. But it also prevents vectorization in cases where one might just store shorts into an int array using `Unsafe`. This saves you a lot of headaches. You probably don't lose too much for not vectorizing those cases.

Exactly. Is there any case in real applications that may store shorts into an int array?

> src/hotspot/share/opto/vmaskloop.cpp line 642:
> 
>> 640: 
>> 641: // Helper method for finding or creating a vector input at specified index
>> 642: Node* VectorMaskedLoop::get_vector_input(Node* node, uint idx) {
> 
> This is analogous to `SuperWord::vector_opd`. Can we not refactor things so that we can share the code?

I like refactoring, but it may require big effort. Shall we discuss it later in our future conversations?

> src/hotspot/share/opto/vmaskloop.cpp line 939:
> 
>> 937:   Node* root_vmask = vmask_tree->at(1);
>> 938: 
>> 939:   // Replace vectorization candidate nodes to vector nodes
> 
> Expand explanation. Say that you are for now only generating a single vector node per scalar node. And that the duplication afterwards makes sure that all scalar nodes are "widened" to the same number of elements. The smalles type using a single vector, larger types using multiple (duplicated) vectors per scalar node.

Thanks for suggestion. Done in commit 2.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1251410392
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1251410844
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1251411540
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1251411689