RFR: 8292675: Add identity transformation for removing redundant AndV/OrV nodes

Thu Sep 8 11:41:54 UTC 2022

On Thu, 8 Sep 2022 08:19:57 GMT, Tobias Hartmann <thartmann at openjdk.org> wrote:

>> Recently we found that the rotate left/right benchmarks with vectorapi
>> emit a redundant "and" instruction on both aarch64 and x86_64 machines
>> which can be done away with.  For example - and(and(a, b), b) generates
>> two "and" instructions which can be reduced to a single "and" operation-
>> and(a, b) since "and" (and "or") operations are commutative and
>> idempotent in nature.  This can help improve performance for all those
>> workloads which have multiple "and"/"or" operations with the same value
>> by reducing them to fewer "and"/"or" operations accordingly.
>> 
>> This patch adds the following transformations for vector logical
>> operations - AndV and OrV :
>> 
>> 
>> (OpV (OpV a b) b) => (OpV a b)
>> (OpV (OpV a b) a) => (OpV a b)
>> (OpV (OpV a b m1) b m1) => (OpV a b m1)
>> (OpV (OpV a b m1) a m1) => (OpV a b m1)
>> (OpV a (OpV a b)) => (OpV a b)
>> (OpV b (OpV a b)) => (OpV a b)
>> (OpV a (OpV a b m) m) => (OpV a b m)
>> 
>> where Op = "And", "Or"
>> 
>> Links for benchmarks tested are given below :-
>> https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/IntMaxVector.java#L728
>> https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/IntMaxVector.java#L764
>> https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/LongMaxVector.java#L728
>> https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/LongMaxVector.java#L764
>> 
>> Before this patch, the disassembly for one these testcases
>> (IntMaxVector.ROR) for Neon is shown below :
>>  ```
>>   ldr     q16, [x12, #16]
>>   and     v16.16b, v16.16b, v20.16b
>>   and     v16.16b, v16.16b, v20.16b
>>   add     x12, x16, x11
>>   sub     v17.4s, v21.4s, v16.4s
>>   ...
>>   ...
>> 
>> 
>> After this patch, the disassembly for the same testcase above is shown
>> below :
>> 
>>   ldr     q16, [x12, #16]
>>   and     v16.16b, v16.16b, v20.16b
>>   add     x12, x16, x11
>>   sub     v17.4s, v21.4s, v16.4s
>>   ...
>>   ...
>> 
>> 
>> The other tests also emit an extra "and" instruction as shown above for
>> the vector ROR/ROL operations.
>> 
>> Below are the performance results for the vectorapi rotate tests (tests
>> given in the links above) with this patch on aarch64 and x86_64 machines
>> (for int and long types) -
>> 
>> 
>> Benchmark                aarch64   x86_64
>> IntMaxVector.ROL         25.57%    26.09%
>> IntMaxVector.ROR         23.75%    24.15%
>> LongMaxVector.ROL        28.91%    28.51%
>> LongMaxVector.ROR        16.51%    29.11%
>> 
>> 
>> 
>> The percentage indicates the percent gain/improvement in performance
>> (ops/ms) with this patch over the master build without this patch. The
>> machine descriptions are given below -
>> aarch64 - 128-bit aarch64 machine
>> x86_64  - 256-bit x86 machine
>
> src/hotspot/share/opto/vectornode.cpp line 1907:
> 
>> 1905:     // (OperationV src1 (OperationV src1 src2 m1) m1) => OperationV(src1 src2 m1)
>> 1906:     } else if (n->is_predicated_vector() && n2->is_predicated_vector() &&
>> 1907:                n->in(3) == n2->in(3) && n->in(1) == n2->in(1)) {
> 
> I think this should be merged into line 1902. Why did you omit the `(OperationV src2 (OperationV src1 src2 m1) m1) => OperationV(src1 src2 m1)` case?

Hi, thank you for your review comments. I will make the suggested changes and upload a new patch.
Regarding omitting this condition - `(OperationV src2 (OperationV src1 src2 m1) m1)` => This would result in a different result as compared to `(OperationV src1 src2 m1)`. So the inner `OperationV` would result in `src1` being copied for unmasked lanes and the outer `OperationV` would result in `src2` being copied to the final result for unmasked lanes and thus the results for both the operations is different. So for masked lanes, the result is `OperationV(src1, src2)` but it is not so for the unmasked lanes. So we need to have the first arguments of both `OperationV` to be same so that the unmasked lanes would then have the same value in the end and that can then be optimized to a single `OperationV` node.

-------------

PR: https://git.openjdk.org/jdk/pull/10163