RFR: 8346664: C2: Optimize mask check with constant offset [v7]
Emanuel Peter
epeter at openjdk.org
Thu Jan 30 08:26:54 UTC 2025
On Wed, 29 Jan 2025 17:23:40 GMT, Matthias Ernst <duke at openjdk.org> wrote:
>> Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization.
>>
>> Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`:
>>
>>
>> (base + (index + 1) << 8) & 255
>> => MulNode
>> (base + (index << 8 + 256)) & 255
>> => AddNode
>> ((base + index << 8) + 256) & 255
>>
>>
>> Currently, `256` is not being recognized as a shifted value. This PR enables further reduction:
>>
>>
>> ((base + index << 8) + 256) & 255
>> => MulNode (this PR)
>> (base + index << 8) & 255
>> => MulNode (PR #6697)
>> base & 255 (loop invariant)
>>
>>
>> Implementation notes:
>> * I verified that the originating issue "scaled varhandle indexed with i+1" (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR.
>> * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~
>> * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~
>
> Matthias Ernst has updated the pull request with a new target base due to a merge or a rebase. The incremental webrev excludes the unrelated changes brought in by the merge/rebase. The pull request contains 20 additional commits since the last revision:
>
> - Merge branch 'openjdk:master' into mernst/JDK-8346664
> - make the check more clear: shift >= mask_width
> - fully randomized
> - JLS: only the lower bits of the shift are taken into account (aka we don't assert).
> - (c)
> - (c)
> - Assert that MulNode::Ideal already masks constant shift amounts for us.
> Avoid accidental zero mask breaking test.
> - "element".
> - avoid redundant comment
> - addConstNonConstMaskLong
> - ... and 10 more: https://git.openjdk.org/jdk/compare/68db7e5c...490cc2fb
I'm already seeing a list of failures.
----------------------
`compiler/vectorization/TestPopulateIndex.java`
VM Flags: `-XX:UseAVX=3`
Not sure if that reproduces on a non-AVX512 machine.
Failed IR Rules (1) of Methods (1)
----------------------------------
1) Method "public void compiler.vectorization.TestPopulateIndex.exprWithIndex1()" - [Failed IR rules: 1]:
* @IR rule 1: "@compiler.lib.ir_framework.IR(phase={DEFAULT}, applyIfPlatformAnd={}, applyIfCPUFeatureOr={}, counts={"_#POPULATE_INDEX#_", "> 0"}, failOn={}, applyIfPlatform={}, applyIfPlatformOr={}, applyIfOr={}, applyIfCPUFeatureAnd={}, applyIf={}, applyIfCPUFeature={}, applyIfAnd={}, applyIfNot={})"
> Phase "PrintIdeal":
- counts: Graph contains wrong number of nodes:
* Constraint 1: "(\\d+(\\s){2}(PopulateIndex.*)+(\\s){2}===.*)"
- Failed comparison: [found] 0 > 0 [given]
- No nodes matched!
This is the method:
@Test
@IR(counts = {IRNode.POPULATE_INDEX, "> 0"})
public void exprWithIndex1() {
for (int i = 0; i < count; i++) {
dst[i] = src[i] * (i & 7);
}
checkResultExprWithIndex1();
}
I suspect the issue is that the `(i & 7)` constant folds for some cases but not others, and then `SuperWord` cannot match all lanes equally. As `i` gets unrolled to `i+0`, `i+1`...`i+7`, `i+8` ... the pattern `(i + 8) & 7` becomes `i & 7`. And that destroys the nice pattern we pattern match for in `superword.cpp`:
// Look for pattern n1 = (iv + c) and n2 = (iv + c + 1), which may lead to
// PopulateIndex vector node. We skip the pack creation of these nodes. They
// will be vectorized by SuperWordVTransformBuilder::get_or_make_vtnode_vector_input_at_index.
bool SuperWord::is_populate_index(const Node* n1, const Node* n2) const {
return n1->is_Add() &&
n2->is_Add() &&
n1->in(1) == iv() &&
n2->in(1) == iv() &&
n1->in(2)->is_Con() &&
n2->in(2)->is_Con() &&
n2->in(2)->get_int() - n1->in(2)->get_int() == 1;
}
**The Dilemma**
We would like to fold away the **mask check** for `MemorySegment` alignment checks before loop-opts, so that we can have the CFG removed before SuperWord / Auto Vectorization. But we possibly destroy **PopulateIndex** patterns - for these we would prefer to delay the mask folding until after SuperWord.
The question: how important is the vectorization of this example? I don't know. Maybe we can make the `is_populate_index` check and other related code smarter, but that introduces complexity of extra special cases into SuperWord - I'm not a fan of that.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/22856#issuecomment-2623827217
More information about the hotspot-compiler-dev
mailing list