RFR: 8346664: C2: Optimize mask check with constant offset [v8]

Thu Jan 30 12:35:48 UTC 2025

On Thu, 30 Jan 2025 11:41:35 GMT, Matthias Ernst <duke at openjdk.org> wrote:

>> Fixes [JDK-8346664](https://bugs.openjdk.org/browse/JDK-8346664): extends the optimization of masked sums introduced in #6697 to cover constant values, which currently break the optimization.
>> 
>> Such constant values arise in an expression of the following form, for example from `MemorySegmentImpl#isAlignedForElement`:
>> 
>> 
>> (base + (index + 1) << 8) & 255
>> => MulNode
>> (base + (index << 8 + 256)) & 255
>> => AddNode
>> ((base + index << 8) + 256) & 255
>> 
>> 
>> Currently, `256` is not being recognized as a shifted value. This PR enables further reduction:
>> 
>> 
>> ((base + index << 8) + 256) & 255
>> => MulNode (this PR)
>> (base + index << 8) & 255
>> => MulNode (PR #6697)
>> base & 255 (loop invariant)
>> 
>> 
>> Implementation notes:
>> * I verified that the originating issue "scaled varhandle indexed with i+1"  (https://mail.openjdk.org/pipermail/panama-dev/2024-December/020835.html) is resolved with this PR.
>> * ~in order to stay with the flow of the current implementation, I refrained from solving general (const & mask)==0 cases, but only those where const == _ << shift.~
>> * ~I modified existing test cases adding/subtracting from the index var (which would fail with current C2). Let me know if would like to see separate cases for these.~
>
> Matthias Ernst has updated the pull request incrementally with one additional commit since the last revision:
> 
>   undo assertion and mask shift amount ourselves.
>   we cannot rely on `maskShiftAmount` having run in all cases

About the vectorization tests.

--------------
`TestEquivalentInvariants`:
Here we can adjust the `@IR` rules. You can see other IR rules in the file for examples.
`testMemorySegmentIInvarL3d` fails because it expects no vector nodes in the IR, hence the `= 0`.
There are other tests that have rules like these, i.e. with `> 0`:

    @Test
    @IR(counts = {IRNode.LOAD_VECTOR_I, "> 0",
                  IRNode.ADD_VI,        "> 0",
                  IRNode.STORE_VECTOR,  "> 0"},
        applyIfPlatform = {"64-bit", "true"},
        applyIf = {"AlignVector", "false"},
        applyIfCPUFeatureOr = {"sse4.1", "true", "asimd", "true"})

Why don't you try tweaking the IR rules until they pass on your machine.

------------------

About `compiler/vectorization/TestPopulateIndex.java`.

I think we need to consider a generalization of `PopulateIndex`. Currently, it only allows the pattern:
`i+0, i+1, i+2, i+3, ...`.

This allows us to do this in the backend, see e.g. `src/hotspot/cpu/x86/x86.ad`:

instruct VectorPopulateIndex(vec dst, rRegI src1, immI_1 src2, vec vtmp) %{
  match(Set dst (PopulateIndex src1 src2));
  effect(TEMP dst, TEMP vtmp);
  format %{ "vector_populate_index $dst $src1 $src2\t! using $vtmp as TEMP" %}
  ins_encode %{
     assert($src2$$constant == 1, "required");
     int vlen_in_bytes = Matcher::vector_length_in_bytes(this);
     int vlen_enc = vector_length_encoding(this);
     BasicType elem_bt = Matcher::vector_element_basic_type(this);
     __ vpbroadcast(elem_bt, $vtmp$$XMMRegister, $src1$$Register, vlen_enc);
     __ load_iota_indices($dst$$XMMRegister, vlen_in_bytes, elem_bt);
     __ vpadd(elem_bt, $dst$$XMMRegister, $dst$$XMMRegister, $vtmp$$XMMRegister, vlen_enc);
  %}
  ins_pipe( pipe_slow );
%}

We take the `i`, broadcast it to every vector lane, and then add the `iota_indices` to it: `i,i,i,i` + `0,1,2,3`.

We could just introduce some constants vector, and add this instead of the `iota_indices`.

That would allow us to generalize it to any `i+c0, i+c1, i+c2, ...`.

I've wanted to have vector constants for other reasons anyway.

TLDR:
- I'm ok with disabling the IR rules in `TestPopulateIndex.java` for `exprWithIndex1`.
- But create another IR test with `i | 7` instead of `i & 7`, which should still vectorize and pass with IR rules, as it does not trigger your optimization.
- We will file an RFE for the generalization of `PopulateIndex` using vector constants instead of `iota_indices`. In that RFE we can mention this failed test as one example of what should work with the generalization.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/22856#issuecomment-2624385212