Value array element size represented as log2

Tue Oct 2 17:00:54 UTC 2018

Are we sure this is needed? C++ packs arrays with alignment of 
sizeof(elem), and it seems to do fine ...

For performance, do we worry about random access to arrays, or counted 
loops?

For counted loops: if we have a program like this:

     static final __ByValue class V { // A Value
         public final int v1;
         public final int v2;
         public final int v3;
     }

     public static V[] arr_V = new V[10000];

     static int count() {
         int n = 0;
         for (V v : arr_V) {
             n += v.v1;
             n += v.v2;
             n += v.v3;
         }
         return n;
     }

C2 generates:

   L_5            112: movslq %r8d,%r10
                  115: shl    $0x4,%r10 ;*aaload {reexecute=0 rethrow=0 
return_oop=0 return_vt=0}
; - Array_idx::count at 18 (line 26)

                  119: add    0x10(%r11,%r10,1),%eax
                  124: add    0x14(%r11,%r10,1),%eax
                  129: add    0x18(%r11,%r10,1),%eax
                  134: add    0x20(%r11,%r10,1),%eax
                  139: add    0x24(%r11,%r10,1),%eax
                  144: add    0x28(%r11,%r10,1),%eax
                  149: add    0x30(%r11,%r10,1),%eax
                  154: add    0x34(%r11,%r10,1),%eax
                  159: add    0x38(%r11,%r10,1),%eax
                  164: add    0x40(%r11,%r10,1),%eax
                  169: add    0x44(%r11,%r10,1),%eax
                  174: add    0x48(%r11,%r10,1),%eax ;*iadd {reexecute=0 
rethrow=0 return_oop=0 return_vt=0}
; - Array_idx::count at 43 (line 29)

                  179: add    $0x4,%r8d ;*iinc {reexecute=0 rethrow=0 
return_oop=0 return_vt=0}
; - Array_idx::count at 45 (line 26)

                  183: cmp    %r9d,%r8d
                  186: jl      L_5 (112) ^ ;*if_icmpge {reexecute=0 
rethrow=0 return_oop=0 return_vt=0}
; - Array_idx::count at 13 (line 26)

If the array is packed with 12 bytes per element, we just need to change 
179 to

                  179: add (12 * 4), %r10
                       cmp  %r9, %r10                        ;; %r9 = 
(array.length * 12)
                       jl   119

It will be faster, and use one fewer register!

We need to pre-scale the loop limit (%r9d) to V.length * 12 (in %r9), 
but we have enough bits on 64-bit platforms.

- Ioi

On 10/2/18 2:24 AM, John Rose wrote:
> On Oct 2, 2018, at 12:55 AM, David Simms <david.simms at oracle.com> wrote:
>> What is it you are asking here ? That you want to 20 bytes rather than pow 2 aligned ?
> 20 bytes would be represented as 5*4, where the 5 comes from the 1/3/5/7 selector,
> and 4 is the power of two.
>
> All odd sizes up to 7, all even sizes up to 16, all multiples of four up to 32 bytes,
> and all multiples of eight up to 64 bytes are representable in this scheme.
>
> The first several byte sizes not exactly representable are:  9(10), 11(12), 13(14), 15(16),
> 17,18,19(20), 21,22,23(24), 25,26,27(28), 29,30,31,(32).  The parenthesized numbers
> are the next higher representable sizes.  As you can see the error is kept small.
>
> The layout helper would have to be tweaked to allow the 1/3/5/7 selector, and users
> of the layout helper which process value arrays would have to take it into account.
>
> A good encoding for the selector would be:
>
> enum sel1357 { sel1=0, sel3=1, sel5=2, sel7=3 };
>
> sel1357 encode_sel(int sel) {
>    sel1357 sel_code = (sel1357)((sel - 1) >> 1);
>    assert(sel == 1 + (sel_code << 1));
>    return sel_code;
> }
> int decode_sel(sel1357 sel_code) {
>    int sel = 1 + ((int)sel_code << 1);
> }
>
> The logic for choosing the best selector and shift could be:
>
> int decode_shift_and_sel(int size, sel1357& return_sel) {
>    int best_err = size, best_sel = 0, best_shift = 0;
>    for (int sel = 1; sel <= 7; sel += 2) {
>      int shift = upper_log2((size + sel - 1) / sel);
>      int err = (1 << shift) * sel;
>      if (err < best_err) { best_err = err; best_sel = sel; best_shift = shift; }
>    }
>    return_sel = encode_sel(best_sel);
>    return best_shift;
> }
>
> (There are surely bugs in this, but you get the idea.  A dead-reckoning
> algorithm could do the job faster, without the loop, but it would be harder
> to read.)
>
>> A long time ago we discussed that mul was too expensive compared to shift, so that encoding is intentional (addressing uses shift)
> No mul is needed, ever.  In most cases the JIT can see the array type and
> emit suitable shift/add sequences to produce 3 (1+2), 5 (1+4), or 7 (8-1).
>
> Even the interpreter, or polymorphic code, can produce suitable ALU
> operations that are competitive with mul, if only those four multipliers
> are possible.
>
> int mul_by_sel(intptr_t x, sel1357 sel) {
>    switch (sel) {
>    case sel1: return x;
>    case sel3: return x + (x << 1);
>    case sel5: return x + (x << 2);
>    case sel7: return (x << 3) - x;
>    }
> }
>
> Similar points can be made for a one-bit 1/3 selector, with more footprint
> and somewhat simpler decoding.
>
> int mul_by_sel(intptr_t x, sel13 sel) {
>    return x + (-(int)sel & (x << 1));
> }
>
> In many cases, the value types will have true power of two size, so that
> the selector will be "sel1" (unit multiplier).  If that's common enough, then
> a speculation might be in order:
>
> int mul_by_sel(intptr_t x, sel1357 sel) {
>    if (sel == sel1)  return x;
>    int result = (x << (int)sel);  // needs fixup for sel7
>    if (sel == sel7)  x = -x;  // fixup
>    result += x;
>    return result;
> }
>
> int mul_by_sel(intptr_t x, sel13 sel) {
>    return sel == sel1 ? x : x + (x << 1);
> }
>
>