RFR: 8292675: Add identity transformation for removing redundant AndV/OrV nodes
Bhavana Kilambi
bkilambi at openjdk.org
Thu Sep 8 11:41:54 UTC 2022
On Thu, 8 Sep 2022 08:19:57 GMT, Tobias Hartmann <thartmann at openjdk.org> wrote:
>> Recently we found that the rotate left/right benchmarks with vectorapi
>> emit a redundant "and" instruction on both aarch64 and x86_64 machines
>> which can be done away with. For example - and(and(a, b), b) generates
>> two "and" instructions which can be reduced to a single "and" operation-
>> and(a, b) since "and" (and "or") operations are commutative and
>> idempotent in nature. This can help improve performance for all those
>> workloads which have multiple "and"/"or" operations with the same value
>> by reducing them to fewer "and"/"or" operations accordingly.
>>
>> This patch adds the following transformations for vector logical
>> operations - AndV and OrV :
>>
>>
>> (OpV (OpV a b) b) => (OpV a b)
>> (OpV (OpV a b) a) => (OpV a b)
>> (OpV (OpV a b m1) b m1) => (OpV a b m1)
>> (OpV (OpV a b m1) a m1) => (OpV a b m1)
>> (OpV a (OpV a b)) => (OpV a b)
>> (OpV b (OpV a b)) => (OpV a b)
>> (OpV a (OpV a b m) m) => (OpV a b m)
>>
>> where Op = "And", "Or"
>>
>> Links for benchmarks tested are given below :-
>> https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/IntMaxVector.java#L728
>> https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/IntMaxVector.java#L764
>> https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/LongMaxVector.java#L728
>> https://github.com/openjdk/panama-vector/blob/2aade73adeabdf6a924136b17fd96ccc95c1d160/test/micro/org/openjdk/bench/jdk/incubator/vector/operation/LongMaxVector.java#L764
>>
>> Before this patch, the disassembly for one these testcases
>> (IntMaxVector.ROR) for Neon is shown below :
>> ```
>> ldr q16, [x12, #16]
>> and v16.16b, v16.16b, v20.16b
>> and v16.16b, v16.16b, v20.16b
>> add x12, x16, x11
>> sub v17.4s, v21.4s, v16.4s
>> ...
>> ...
>>
>>
>> After this patch, the disassembly for the same testcase above is shown
>> below :
>>
>> ldr q16, [x12, #16]
>> and v16.16b, v16.16b, v20.16b
>> add x12, x16, x11
>> sub v17.4s, v21.4s, v16.4s
>> ...
>> ...
>>
>>
>> The other tests also emit an extra "and" instruction as shown above for
>> the vector ROR/ROL operations.
>>
>> Below are the performance results for the vectorapi rotate tests (tests
>> given in the links above) with this patch on aarch64 and x86_64 machines
>> (for int and long types) -
>>
>>
>> Benchmark aarch64 x86_64
>> IntMaxVector.ROL 25.57% 26.09%
>> IntMaxVector.ROR 23.75% 24.15%
>> LongMaxVector.ROL 28.91% 28.51%
>> LongMaxVector.ROR 16.51% 29.11%
>>
>>
>>
>> The percentage indicates the percent gain/improvement in performance
>> (ops/ms) with this patch over the master build without this patch. The
>> machine descriptions are given below -
>> aarch64 - 128-bit aarch64 machine
>> x86_64 - 256-bit x86 machine
>
> src/hotspot/share/opto/vectornode.cpp line 1907:
>
>> 1905: // (OperationV src1 (OperationV src1 src2 m1) m1) => OperationV(src1 src2 m1)
>> 1906: } else if (n->is_predicated_vector() && n2->is_predicated_vector() &&
>> 1907: n->in(3) == n2->in(3) && n->in(1) == n2->in(1)) {
>
> I think this should be merged into line 1902. Why did you omit the `(OperationV src2 (OperationV src1 src2 m1) m1) => OperationV(src1 src2 m1)` case?
Hi, thank you for your review comments. I will make the suggested changes and upload a new patch.
Regarding omitting this condition - `(OperationV src2 (OperationV src1 src2 m1) m1)` => This would result in a different result as compared to `(OperationV src1 src2 m1)`. So the inner `OperationV` would result in `src1` being copied for unmasked lanes and the outer `OperationV` would result in `src2` being copied to the final result for unmasked lanes and thus the results for both the operations is different. So for masked lanes, the result is `OperationV(src1, src2)` but it is not so for the unmasked lanes. So we need to have the first arguments of both `OperationV` to be same so that the unmasked lanes would then have the same value in the end and that can then be optimized to a single `OperationV` node.
-------------
PR: https://git.openjdk.org/jdk/pull/10163
More information about the hotspot-compiler-dev
mailing list