RFR: 8322174: RISC-V: C2 VectorizedHashCode RVV Version [v8]
Robbin Ehn
rehn at openjdk.org
Tue May 6 17:19:19 UTC 2025
On Mon, 5 May 2025 10:17:27 GMT, Yuri Gaevsky <duke at openjdk.org> wrote:
>> The patch adds possibility to use RVV instructions for faster vectorizedHashCode calculations on RVV v1.0.0 capable hardware.
>>
>> Testing: hotspot/jtreg/compiler/ under QEMU-8.1 with RVV v1.0.0.
>
> Yuri Gaevsky has updated the pull request incrementally with one additional commit since the last revision:
>
> change slli+add sequence to shadd
Hey, I'm sorry for not explaining this proper, maybe this helps:
You have four coefficients - you want to process a batch of four, _OR_ a mutiple of four.
This batch of four - we call this a lane:
int lane = array[currentIndex] * m_pow_3 + array[currentIndex + 1] * m_pow_2 + array[currentIndex + 2] * m_pow_1 + array[currentIndex + 3] * m_pow_0;
hashCode = hashCode * m_pow_4 + lane;
You can process mutiple lanes by doing:
int lane_1 = array[currentIndex ] * m_pow_3 + array[currentIndex + 1] * m_pow_2 + array[currentIndex + 2] * m_pow_1 + array[currentIndex + 3] * m_pow_0;
int lane_2 = array[currentIndex+4] * m_pow_3 + array[currentIndex + 5] * m_pow_2 + array[currentIndex + 6] * m_pow_1 + array[currentIndex + 7] * m_pow_0;
hashCode = hashCode * m_pow_4 + lane1;
hashCode = hashCode * m_pow_4 + lane2;
So for example you could layout the data like below using vlse32.v, strided load.
v2 = array[currentIndex] | array[currentIndex+4] | .... | array[currentIndex+n*4]
v4 = array[currentIndex+1] | array[currentIndex+5] | .... | array[currentIndex+1+n*4]
v6 = array[currentIndex+2] | array[currentIndex+6] | .... | array[currentIndex+2+n*4]
v8 = array[currentIndex+3] | array[currentIndex+7] | .... | array[currentIndex+3+n*4]
v10 = sum lane 1 | sum lane 2 | .... | sum lane n
Now you can multiple every element in v2 with m_pow_3 without knowing the length of v2 (i.e. LMUL can be 1 or 8).
Then sum each lane into v10, and finally for each lane mutiple hashcode by m_pow_4 and add that lane sum.
When this is done, you have 0-3 elements left you can process with scalar.
So when you do:
`vsetvli vl_processing, count/4, emul, lmul`
vl_processing == number of lanes. There is no need to know the length of the vector registers.
NOTE: I'm not saying this is better or faster than your version - it's hopefully an example of a vector length agnostic approach.
-------------
PR Comment: https://git.openjdk.org/jdk/pull/17413#issuecomment-2855340992
More information about the hotspot-compiler-dev
mailing list