RFR: JDK-8308994: C2: Re-implement experimental post loop vectorization

Tue Jun 27 17:52:18 UTC 2023

On Fri, 23 Jun 2023 14:44:15 GMT, Emanuel Peter <epeter at openjdk.org> wrote:

>> ## TL;DR
>> 
>> This patch completely re-implements C2's experimental post loop vectorization for better stability, maintainability and performance. Compared with the original implementation, this new implementation adds a standalone loop phase in C2's ideal loop phases and can vectorize more post loops. The original implementation and all code related to multi-versioned post loops are deleted in this patch. More details about this patch can be found in the document replied in this pull request.
>
> src/hotspot/share/opto/vmaskloop.cpp line 550:
> 
>> 548:   //  2) Address is growing down (index scale * loop stride < 0)
>> 549:   //  3) Memory access scale is different from data size
>> 550:   //  4) The loop increment node is on the SWPointer's node stack
> 
> Why should the `incr` not be on the node stack?

Does that not prevent `a[i+1]` from being accepted?

> src/hotspot/share/opto/vmaskloop.cpp line 595:
> 
>> 593:   uint tree_depth = exact_log2(large) - exact_log2(small) + 1;
>> 594:   // All vector masks construct a perfect binary tree of "2 ^ depth - 1" nodes
>> 595:   // We create a list of "2 ^ depth" nodes for easier computation.
> 
> Assume we have a small and a large type (byte and long). Size 1 and 8. `tree_depth = log2(8) - log2(1) + 1 = 3 - 0 + 1 = 4`. Then you generate a tree with `2^4-1 = 15` nodes. Did I calculate this right? That seems a bit excessive. Would be interesting to see benchmarks for mixed type cases.

Can there be cases where creating the masks makes vectorization unprofitable?

> src/hotspot/share/opto/vmaskloop.cpp line 785:
> 
>> 783: }
>> 784: 
>> 785: // Duplicate vectorized operations with given vector element size
> 
> Got to here today. There should probably be some comment higher up that you first replace scalars with one vector each, and then duplicate them for the larger types that need multiple vectors.
> 
> I'm also concerned that there may be some platforms where the max vector width in bytes is not the same for all types. But maybe all platforms that support masked register ops also all have the same vector width in bytes for all types?

Assume we only allow `32` bit registers for `int`, but `64` bits for doubles. Now you'd be assuming that there need to be double as many `double` vectors as `int` vectors. But actually, they need the same amount of vectors, because vectors of both sizes fit exactly `8` elements.

-------------

PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1244068613
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1244093283
PR Review Comment: https://git.openjdk.org/jdk/pull/14581#discussion_r1244130010