RFR: 8318650: Optimized subword gather for x86 targets. [v13]

Tue Feb 20 08:53:02 UTC 2024

On Wed, 7 Feb 2024 18:38:29 GMT, Jatin Bhateja <jbhateja at openjdk.org> wrote:

>> Hi All,
>> 
>> This patch optimizes sub-word gather operation for x86 targets with AVX2 and AVX512 features.
>> 
>> Following is the summary of changes:-
>> 
>> 1) Intrinsify sub-word gather using hybrid algorithm which initially partially unrolls scalar loop to accumulates values from gather indices into a quadword(64bit) slice followed by vector permutation to place the slice into appropriate vector lanes, it prevents code bloating and generates compact JIT sequence. This coupled with savings from expansive array allocation in existing java implementation translates into significant performance of 1.5-10x gains with included micro.
>> 
>> ![image](https://github.com/openjdk/jdk/assets/59989778/e25ba4ad-6a61-42fa-9566-452f741a9c6d)
>> 
>> 
>> 2) Patch was also compared against modified java fallback implementation by replacing temporary array allocation with zero initialized vector and a scalar loops which inserts gathered values into vector. But, vector insert operation in higher vector lanes is a three step process which first extracts the upper vector 128 bit lane, updates it with gather subword value and then inserts the lane back to its original position. This makes inserts into higher order lanes costly w.r.t to proposed solution. In addition generated JIT code for modified fallback implementation was very bulky. This may impact in-lining decisions into caller contexts.
>> 
>> Kindly review and share your feedback.
>> 
>> Best Regards,
>> Jatin
>
> Jatin Bhateja has updated the pull request incrementally with one additional commit since the last revision:
> 
>   Review comments resolutions.

Thanks for the email-ping!
A really nice feature :)
I left some more requests, comments and questions.

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1582:

> 1580:   if (elem_bt == T_SHORT) {
> 1581:     Label case0, case1, case2, case3;
> 1582:     Label *larr[] = {&case0, &case1, &case2, &case3};

Suggestion:

    Label* larr[] = {&case0, &case1, &case2, &case3};

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1584:

> 1582:     Label *larr[] = {&case0, &case1, &case2, &case3};
> 1583:     for (int i = 0; i < 4; i++) {
> 1584:       // dst[i] = mask ? src[index[i]] : 0

I like these comments a lot!
They would be even better if they had a more clear relation to the register names.

`dst[i] = mask[i+midx] ? src[idx_base[i]] : 0`
It would then be helpful to know the types of these arrays.
It seems that `idx_base` is an array with type int, right?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1585:

> 1583:     for (int i = 0; i < 4; i++) {
> 1584:       // dst[i] = mask ? src[index[i]] : 0
> 1585:       btq(mask, midx);

Do I see it right, that `midx` selects what bit is selected? Why this name? Can we have something more descriptive?
I know it seems to be generally the style in the AD files to not have comments and use non-descriptive register names. But that really makes the code hard to read, I basically need to reverse-engineer everything, and figure out all invariants, preconditions, postconditions etc myself. That makes reviewing hard.
With so many register arguments: how am I supposed to know which registers have what meaning, which ones are modified, etc?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1595:

> 1593:     assert(elem_bt == T_BYTE, "");
> 1594:     Label case0, case1, case2, case3, case4, case5, case6, case7;
> 1595:     Label *larr[] = {&case0, &case1, &case2, &case3,

Suggestion:

    Label* larr[] = {&case0, &case1, &case2, &case3,

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1604:

> 1602:       pinsrb(dst, Address(base, rtmp), i);
> 1603:       bind(*larr[i]);
> 1604:       incq(midx);

I don't know much about labels. But why can they not be local to the scope of the loop?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1624:

> 1622:       jccb(Assembler::carryClear, *larr[i]);
> 1623:       movl(rtmp, Address(idx_base, i * 4));
> 1624:       addl(rtmp, offset);

This is really the only line that is additional to the code in `vgather8b_masked`. Why not just put it under a `bool has_offset` flag, and then you ran just reuse this code here from `vgather8b_masked`, and reduce code duplication?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1694:

> 1692:  * Gather loop first packs 4 short / 8 byte values from gather indices
> 1693:  * into quadword lane and then permutes quadword lane into appropriate
> 1694:  * location in destination vector. Following pseudo code describes the

I found this a bit confusing. Would this be another way of saying this:
`We divide the gather operations into quadwords of 8 bytes (or 4 shorts). Each quadword is gathered using the vgather8b instruction. We combine the quadwords into the destination register via OR and shifting (i.e. rotating / permuting) the destination register, until every quadword is in the desired location.`

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1695:

> 1693:  * into quadword lane and then permutes quadword lane into appropriate
> 1694:  * location in destination vector. Following pseudo code describes the
> 1695:  * algorithm in detail:-

Suggestion:

 * algorithm in detail:

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1707:

> 1705:  *
> 1706:  * With each iteration, doubleword permute indices (0,1) corresponding
> 1707:  * to assembled quadword gets right shifted by two lane position.

Suggestion:

 * With each iteration, doubleword permute indices (0,1) corresponding to gathered
 * quadword gets right shifted by two lane position (i.e. 8 bytes).

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1716:

> 1714:                                         XMMRegister xtmp3, Register rtmp,
> 1715:                                         Register midx, Register length,
> 1716:                                         int vector_len, int vlen_enc) {

I would like to see more descriptive names, where I don't have to reverse-engineer their meaning.
What are the pre/post-conditions on `midx`?

src/hotspot/cpu/x86/c2_MacroAssembler_x86.cpp line 1723:

> 1721:   vpxor(dst, dst, dst, vlen_enc);
> 1722:   vallones(xtmp2, vlen_enc);
> 1723:   vpsubd(xtmp2, xtmp1, xtmp2 ,vlen_enc);

Suggestion:

  vpsubd(xtmp2, xtmp1, xtmp2, vlen_enc);

src/hotspot/cpu/x86/x86.ad line 4111:

> 4109: %}
> 4110: 
> 4111: instruct vgather_subwordLE8B(vec dst, memory mem, rRegP idx, immI_0 offset, rRegP tmp, rRegI rtmp) %{

Suggestion:

instruct vgather_subwordLE8B(vec dst, memory mem, rRegP base_index, immI_0 offset, rRegP tmp, rRegI rtmp) %{

For consistency

src/hotspot/cpu/x86/x86.ad line 4120:

> 4118:     BasicType elem_bt = Matcher::vector_element_basic_type(this);
> 4119:     __ lea($tmp$$Register, $mem$$Address);
> 4120:     __ vgather8b(elem_bt, $dst$$XMMRegister, $tmp$$Register, $idx$$Register, $rtmp$$Register, vlen_enc);

The `LE8B` and `Matcher::vector_length_in_bytes(n) <= 8` suggest we can perform this with 4 bytes as well.
Is that correct?
Would that not lead to issues, when we are then reading `base_index` at bytes 4...7, which possibly have garbage, and then use that to gather?
Do we have tests for that?

src/hotspot/cpu/x86/x86.ad line 4143:

> 4141: %}
> 4142: 
> 4143: instruct vgather_subwordLE8B_off(vec dst, memory mem, rRegP idx, rRegI offset, rRegP tmp, rRegI rtmp, rFlagsReg cr) %{

Suggestion:

instruct vgather_subwordLE8B_offset(vec dst, memory mem, rRegP idx, rRegI offset, rRegP tmp, rRegI rtmp, rFlagsReg cr) %{

src/hotspot/cpu/x86/x86.ad line 4147:

> 4145:   match(Set dst (LoadVectorGather mem (Binary idx offset)));
> 4146:   effect(TEMP tmp, TEMP rtmp, KILL cr);
> 4147:   format %{ "vector_gatherLE8 $dst, $mem, $idx, $offset\t! using $tmp and $rtmp as TEMP" %}

Suggestion:

  format %{ "vector_gatherLE8_offset $dst, $mem, $idx, $offset\t! using $tmp and $rtmp as TEMP" %}

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ByteVector.java line 3071:

> 3069:                 .fromArray(lsp, indexMap, mapOffset + i)
> 3070:                 .add(offset);
> 3071:             vix = VectorIntrinsics.checkIndex(vix, a.length);

are you using the `vix` after this assignment?

src/jdk.incubator.vector/share/classes/jdk/incubator/vector/ShortVector.java line 3809:

> 3807: 
> 3808:         // Check indices are within array bounds.
> 3809:         // FIXME: Check index under mask controlling.

Did you mean to leave a FIXME? If so, please reference a JIRA bug number where you intend to fix it.

-------------

Changes requested by epeter (Reviewer).

PR Review: https://git.openjdk.org/jdk/pull/16354#pullrequestreview-1889647285
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495342140
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495355989
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495351192
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495342349
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495341931
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495361896
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495400579
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495366111
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495405212
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495414923
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495415770
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495418305
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495424410
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495426048
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495426552
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495434432
PR Review Comment: https://git.openjdk.org/jdk/pull/16354#discussion_r1495440107