RFR: 8318650: Optimized subword gather for x86 targets. [v15]

Emanuel Peter epeter at openjdk.org
Mon Feb 26 13:49:57 UTC 2024


On Mon, 26 Feb 2024 13:14:24 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi All,
>> 
>> This patch optimizes sub-word gather operation for x86 targets with AVX2 and AVX512 features.
>> 
>> Following is the summary of changes:-
>> 
>> 1) Intrinsify sub-word gather using hybrid algorithm which initially partially unrolls scalar loop to accumulates values from gather indices into a quadword(64bit) slice followed by vector permutation to place the slice into appropriate vector lanes, it prevents code bloating and generates compact JIT sequence. This coupled with savings from expansive array allocation in existing java implementation translates into significant performance of 1.5-10x gains with included micro.
>> 
>> ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d)
>> 
>> 
>> 2) Patch was also compared against modified java fallback implementation by replacing temporary array allocation with zero initialized vector and a scalar loops which inserts gathered values into vector. But, vector insert operation in higher vector lanes is a three step process which first extracts the upper vector 128 bit lane, updates it with gather subword value and then inserts the lane back to its original position. This makes inserts into higher order lanes costly w.r.t to proposed solution. In addition generated JIT code for modified fallback implementation was very bulky. This may impact in-lining decisions into caller contexts.
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Review comments resolutions

Reposting link to a conversation that is marked "resolved":
https://github.com/openjdk/jdk/pull/16354#discussion_r1502617942

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1672:

> 1670:                                         XMMRegister xtmp3, Register rtmp,
> 1671:                                         Register mask_idx, Register length,
> 1672:                                         int vector_len, int vlen_enc) {

> These are temporary variable and appropriately named.

I disagree. Names should be descriptive, and not numbered.

At least some of your registers here have only a single use. Those could be named:
`xtmp_xxxx` instead of `xtmp1..3`.

If for some reason you don't want to do that, then say why, and at least add a comment that helps the reader have a mental map from (meaningless) register names to their meaning. I really don't want to have to reverse-engineer things in a review if it can be avoided.

-------------

PR Comment: https://git.openjdk.org/jdk/pull/16354#issuecomment-1964188004
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1502640887


More information about the hotspot-compiler-dev mailing list