RFR: 8338023: Support two vector selectFrom API [v10]
Emanuel Peter
epeter at openjdk.org
Mon Sep 16 07:51:15 UTC 2024
On Mon, 16 Sep 2024 02:58:41 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:
>> Hi All,
>>
>> As per the discussion on panama-dev mailing list[1], patch adds the support for following new two vector permutation APIs.
>>
>>
>> Declaration:-
>> Vector<E>.selectFrom(Vector<E> v1, Vector<E> v2)
>>
>>
>> Semantics:-
>> Using index values stored in the lanes of "this" vector, assemble the values stored in first (v1) and second (v2) vector arguments. Thus, first and second vector serves as a table, whose elements are selected based on index value vector. API is applicable to all integral and floating-point types. The result of this operation is semantically equivalent to expression v1.rearrange(this.toShuffle(), v2). Values held in index vector lanes must lie within valid two vector index range [0, 2*VLEN) else an IndexOutOfBoundException is thrown.
>>
>> Summary of changes:
>> - Java side implementation of new selectFrom API.
>> - C2 compiler IR and inline expander changes.
>> - In absence of direct two vector permutation instruction in target ISA, a lowering transformation dismantles new IR into constituent IR supported by target platforms.
>> - Optimized x86 backend implementation for AVX512 and legacy target.
>> - Function tests covering new API.
>>
>> JMH micro included with this patch shows around 10-15x gain over existing rearrange API :-
>> Test System: Intel(R) Xeon(R) Platinum 8480+ [ Sapphire Rapids Server]
>>
>>
>> Benchmark (size) Mode Cnt Score Error Units
>> SelectFromBenchmark.rearrangeFromByteVector 1024 thrpt 2 2041.762 ops/ms
>> SelectFromBenchmark.rearrangeFromByteVector 2048 thrpt 2 1028.550 ops/ms
>> SelectFromBenchmark.rearrangeFromIntVector 1024 thrpt 2 962.605 ops/ms
>> SelectFromBenchmark.rearrangeFromIntVector 2048 thrpt 2 479.004 ops/ms
>> SelectFromBenchmark.rearrangeFromLongVector 1024 thrpt 2 359.758 ops/ms
>> SelectFromBenchmark.rearrangeFromLongVector 2048 thrpt 2 178.192 ops/ms
>> SelectFromBenchmark.rearrangeFromShortVector 1024 thrpt 2 1463.459 ops/ms
>> SelectFromBenchmark.rearrangeFromShortVector 2048 thrpt 2 727.556 ops/ms
>> SelectFromBenchmark.selectFromByteVector 1024 thrpt 2 33254.830 ops/ms
>> SelectFromBenchmark.selectFromByteVector 2048 thrpt 2 17313.174 ops/ms
>> SelectFromBenchmark.selectFromIntVector 1024 thrpt 2 10756.804 ops/ms
>> S...
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
>
> Disabling VectorLoadShuffle bypassing optimization to comply with rearrange semantics at IR level.
Looks better, I still have a few comments.
src/hotspot/share/opto/vectorIntrinsics.cpp line 2739:
> 2737: return true;
> 2738: }
> 2739:
@jatin-bhateja You still have 3x `unbox failed v1` here. I already commented this earlier, and you resolved it and gave it a thumbs up 🙈
Can you please fix it now?
src/hotspot/share/opto/vectornode.cpp line 2116:
> 2114: const TypeVect* index_vect_type = index_vec->bottom_type()->is_vect();
> 2115: BasicType index_elem_bt = index_vect_type->element_basic_type();
> 2116: assert(!is_floating_point_type(index_elem_bt), "");
Why not verify this also in the constructor of `SelectFromTwoVectorNode`? Can you maybe explicitly verify what it must be rather than **not** be?
src/hotspot/share/opto/vectornode.cpp line 2122:
> 2120: // index format by subsequent VectorLoadShuffle.
> 2121: int cast_vopc = VectorCastNode::opcode(0, index_elem_bt, true);
> 2122: Node* index_byte_vec = phase->transform(VectorCastNode::make(cast_vopc, index_vec, T_BYTE, num_elem));
This cast assumes that the indices cannot have more than 8 bits. This would allow vector lengths of up to 256. This is fine for intel. But as far as I know ARM has in principle longer vectors - up to 2048 bytes. Should we maybe add some assert here to make sure we never badly truncate the index?
src/hotspot/share/opto/vectornode.cpp line 2138:
> 2136:
> 2137: // Load indexes from byte vector and appropriatly massage them to target specific
> 2138: // permutation index format.
I would replace `massage` -> `transform` everywhere.
src/hotspot/share/opto/vectornode.hpp line 1625:
> 1623: Node* Ideal(PhaseGVN* phase, bool can_reshape);
> 1624: virtual int Opcode() const;
> 1625: };
`index` -> `indexes` because this is a vector, right? Otherwise I'll assume it is a scalar.
Can you do some pseudo-code, that says how exactly the indices are interpreted? What if they are out of bounds? Does it wrap? Or assume they are in bounds? Undefined behaviour?
-------------
Changes requested by epeter (Reviewer).
PR Review: https://git.openjdk.org/jdk/pull/20508#pullrequestreview-2305905569
PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1760651336
PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1760674461
PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1760678107
PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1760680772
PR Review Comment: https://git.openjdk.org/jdk/pull/20508#discussion_r1760665944
More information about the core-libs-dev
mailing list