RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v8]

Tue May 6 17:19:19 UTC 2025

On Mon, 5 May 2025 10:17:27 GMT, Yuri Gaevsky <duke at openjdk.org> wrote:

>> The patch adds possibility to use RVV instructions for faster vectorizedHashCode calculations on RVV v1.0.0 capable hardware.
>> 
>> Testing: hotspot/jtreg/compiler/ under QEMU-8.1 with RVV v1.0.0.
>
> Yuri Gaevsky has updated the pull request incrementally with one additional commit since the last revision:
> 
>   change slli+add sequence to shadd

Hey, I'm sorry for not explaining this proper, maybe this helps:

You have four coefficients - you want to process a batch of four, _OR_ a mutiple of four.
This batch of four - we call this a lane:

            int lane = array[currentIndex] * m_pow_3 + array[currentIndex + 1] * m_pow_2 + array[currentIndex + 2] * m_pow_1 + array[currentIndex + 3] * m_pow_0;
            hashCode = hashCode * m_pow_4 + lane;

You can process mutiple lanes by doing:

            int lane_1 = array[currentIndex  ] * m_pow_3 + array[currentIndex + 1] * m_pow_2 + array[currentIndex + 2] * m_pow_1 + array[currentIndex + 3] * m_pow_0;
            int lane_2 = array[currentIndex+4] * m_pow_3 + array[currentIndex + 5] * m_pow_2 + array[currentIndex + 6] * m_pow_1 + array[currentIndex + 7] * m_pow_0;
            hashCode = hashCode * m_pow_4 + lane1;
            hashCode = hashCode * m_pow_4 + lane2;

So for example you could layout the data like below using vlse32.v, strided load.

v2 = array[currentIndex]   | array[currentIndex+4] | .... | array[currentIndex+n*4]
v4 = array[currentIndex+1] | array[currentIndex+5] | .... | array[currentIndex+1+n*4]
v6 = array[currentIndex+2] | array[currentIndex+6] | .... | array[currentIndex+2+n*4]
v8 = array[currentIndex+3] | array[currentIndex+7] | .... | array[currentIndex+3+n*4]
v10 = sum lane 1           | sum lane 2            | .... | sum lane n

Now you can multiple every element in v2 with m_pow_3 without knowing the length of v2 (i.e. LMUL can be 1 or 8).
Then sum each lane into v10, and finally for each lane mutiple hashcode by m_pow_4 and add that lane sum.

When this is done, you have 0-3 elements left you can process with scalar.

So when you do:
`vsetvli vl_processing, count/4, emul, lmul`
vl_processing == number of lanes. There is no need to know the length of the vector registers.

NOTE: I'm not saying this is better or faster than your version - it's hopefully an example of a vector length agnostic approach.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/17413#issuecomment-2855340992