RFR: 8350988: Consolidate Identity of self-inverse operations

Mon Mar 3 14:06:02 UTC 2025

On Mon, 3 Mar 2025 09:20:11 GMT, Damon Fenacci <dfenacci at openjdk.org> wrote:

> I'm not totally sure I fully get what you mean here: does this optimization hinder vectorization in some cases? Does this result in a slowdown? (BTW do you have benchmark results?) Should we possibly try to detect this early and avoid simplifying?

What happens basically comes down to this check: https://github.com/openjdk/jdk/blob/885338b5f38ed05d8b91efc0178b371f2f89310e/src/hotspot/share/opto/superword.cpp#L1759
Without my change, `_num_work_vecs` is 3 (I assume, I didn't debug that part) as we have one load and two reverse bytes operations. `_num_reductions` is 1, the xor. With my change, when we come to this check, `_num_work_vecs` is 1 (That part I checked with the debugger), as we only have the load left. So superword does not consider vectorization to be profitable.

My benchmark code: https://gist.github.com/SirYwell/a76578dc5f3c10cd08b768a3bd39a988
Results on my machine (Ryzen 9 3900X):
mainline

Benchmark                           Mode  Cnt     Score     Error   Units
DoubledReverseBytes.doubleReverse  thrpt    3  3287,042 ± 398,656  ops/ms
DoubledReverseBytes.folded         thrpt    3   418,627 ±  20,797  ops/ms

this pr

Benchmark                           Mode  Cnt    Score    Error   Units
DoubledReverseBytes.doubleReverse  thrpt    3  419,369 ± 24,974  ops/ms
DoubledReverseBytes.folded         thrpt    3  415,469 ± 88,714  ops/ms

You can see the almost 8x speedup due to vectorization that happens on mainline but not anymore with my change.

I don't think this should block this change. Detecting such situations also seems like a rather complicated workaround.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/23851#issuecomment-2694504151