RFR: 8346664: C2: Optimize mask check with constant offset [v7]

Thu Jan 30 08:26:54 UTC 2025

On Wed, 29 Jan 2025 17:23:40 GMT, Matthias Ernst <duke at openjdk.org> wrote:

>> Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization.
>> 
>> Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`:
>> 
>> 
>> (base + (index + 1) << 8) & 255
>> => MulNode
>> (base + (index << 8 + 256)) & 255
>> => AddNode
>> ((base + index << 8) + 256) & 255
>> 
>> 
>> Currently, `256` is not being recognized as a shifted value. This PR enables further reduction:
>> 
>> 
>> ((base + index << 8) + 256) & 255
>> => MulNode (this PR)
>> (base + index << 8) & 255
>> => MulNode (PR #6697)
>> base & 255 (loop invariant)
>> 
>> 
>> Implementation notes:
>> * I verified that the originating issue "scaled varhandle indexed with i+1"  (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR.
>> * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~
>> * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~
>
> Matthias Ernst has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 20 additional commits since the last revision:
> 
>  - Merge branch 'openjdk:master' into mernst/JDK-8346664
>  - make the check more clear: shift >= mask_width
>  - fully randomized
>  - JLS: only the lower bits of the shift are taken into account (aka we don't assert).
>  - (c)
>  - (c)
>  - Assert that MulNode::Ideal already masks constant shift amounts for us.
>    Avoid accidental zero mask breaking test.
>  - "element".
>  - avoid redundant comment
>  - addConstNonConstMaskLong
>  - ... and 10 more: https://git.openjdk.org/jdk/compare/68db7e5c...490cc2fb

I'm already seeing a list of failures.

----------------------
`compiler/vectorization/TestPopulateIndex.java`
VM Flags: `-XX:UseAVX=3`
Not sure if that reproduces on a non-AVX512 machine.

Failed IR Rules (1) of Methods (1)
----------------------------------
1) Method "public void compiler.vectorization.TestPopulateIndex.exprWithIndex1()" - [Failed IR rules: 1]:
   * @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#POPULATE_INDEX#_", "> 0"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
     > Phase "PrintIdeal":
       - counts: Graph contains wrong number of nodes:
         * Constraint 1: "(\\d+(\\s){2}(PopulateIndex.*)+(\\s){2}===.*)"
           - Failed comparison: [found] 0 > 0 [given]
           - No nodes matched!

This is the method:

    @Test
    @IR(counts = {IRNode.POPULATE_INDEX, "> 0"})
    public void exprWithIndex1() {
        for (int i = 0; i < count; i++) {
            dst[i] = src[i] * (i & 7);
        }
        checkResultExprWithIndex1();
    }

I suspect the issue is that the `(i & 7)` constant folds for some cases but not others, and then `SuperWord` cannot match all lanes equally. As `i` gets unrolled to `i+0`, `i+1`...`i+7`, `i+8` ... the pattern `(i + 8) & 7` becomes `i & 7`. And that destroys the nice pattern we pattern match for in `superword.cpp`:

// Look for pattern n1 = (iv + c) and n2 = (iv + c + 1), which may lead to
// PopulateIndex vector node. We skip the pack creation of these nodes. They
// will be vectorized by SuperWordVTransformBuilder::get_or_make_vtnode_vector_input_at_index.
bool SuperWord::is_populate_index(const Node* n1, const Node* n2) const {
  return n1->is_Add() &&
         n2->is_Add() &&
         n1->in(1) == iv() &&
         n2->in(1) == iv() &&
         n1->in(2)->is_Con() &&
         n2->in(2)->is_Con() &&
         n2->in(2)->get_int() - n1->in(2)->get_int() == 1;
}

**The Dilemma**
We would like to fold away the **mask check** for `MemorySegment` alignment checks before loop-opts, so that we can have the CFG removed before SuperWord / Auto Vectorization. But we possibly destroy **PopulateIndex** patterns - for these we would prefer to delay the mask folding until after SuperWord.

The question: how important is the vectorization of this example? I don't know. Maybe we can make the `is_populate_index` check and other related code smarter, but that introduces complexity of extra special cases into SuperWord - I'm not a fan of that.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22856#issuecomment-2623827217