[vector] Sparse load (simple gather) of vector from array?

Wed Mar 20 23:07:59 UTC 2019

Hi Lev,

>   I have (long) array of floats: `float[] data`. I have Spicies of
> FloatArray, `S`. I need to create `FloatVector` with even elements (0,
> 2, 4, ...) set to elements from `data` array and odd elements set to
> zero (`0.0f`).
> 
>    So, if we use 256 bit vectors shape, it should be like this:
> 
> {data[offest+0], 0.0f, data[offest+1], 0.0f, data[offest+2], 0.0f,
> data[offest+3], 0.0f}
> 
>    I could load vector from array and blend it with zero vector, but it
> will have wrong float data from array (like `offset+0`, `offset+2`, etc).
> 
>   What is simplest way to perform such "gather" load?

Strictly speaking, it's not a "gather" operation - the input data being 
accessed is still laid out contiguously in memory.

You can achieve your goal by adapting vector contents after it is loaded 
from memory. On API level you can achieve that by reshape + rearrange + 
blend and I see it as the idiomatic way to implement such transformation:

(d_i = data[offest+i])

{d_0, d_1, d_2, d_3}                     // Float128Vector

   =reshape=>

{d_0, d_1, d_2, d_3,   0,   0,   0,   0} // Float256Vector

{  0,   0,   1,   1,   2,   2,   3,   3} // Float256Shuffle

   =rearrange=>

{d_0, d_0, d_1, d_1, d_2, d_2, d_3, d_3} // Float256Vector

{  0,   0,   0,   0,   0,   0,   0,   0} // Float256Vector
{  T,   F,   T,   F,   T,   F,   T,   F} // Float256Mask

   =blend=>

{d_0,   0, d_1,   0, d_2,   0, d_3,   0} // Float256Vector

Depending on the platform, there may be a more efficient way available 
to implement it. For example, on x86 it's possible to implement a 
similar transformation using a single instruction:

   * VPSHUFB ymm1, ymm2, ymm3/m256
   * VPSHUFD ymm1, ymm2/m256, imm8

Current vision is that it should be back-end responsibility to choose 
most optimal implementation, but x86 one isn't powerful enough yet to 
perform such optimizations.

For example, rearrange + blend can be fused into a VPSHUFB by setting 
MSB in shuffle control mask for elements being zeroed and reshape 
becomes no-op when upper part of the source register is known to be 
zeroed (e.g., EVEX/VEX-encoded vector `mov`s zero remaining part of the 
vector register).

(There were some explorations in simplifying complex vector 
transformations [1] [2], but there hasn't been too much progress on that 
front yet.)

Best regards,
Vladimir Ivanov

[1] http://mail.openjdk.java.net/pipermail/panama-dev/2018-July/002374.html

[2] 
http://mail.openjdk.java.net/pipermail/panama-dev/2018-August/002440.html