Value array element size represented as log2
Ioi Lam
ioi.lam at oracle.com
Tue Oct 2 17:00:54 UTC 2018
Are we sure this is needed? C++ packs arrays with alignment of
sizeof(elem), and it seems to do fine ...
For performance, do we worry about random access to arrays, or counted
loops?
For counted loops: if we have a program like this:
static final __ByValue class V { // A Value
public final int v1;
public final int v2;
public final int v3;
}
public static V[] arr_V = new V[10000];
static int count() {
int n = 0;
for (V v : arr_V) {
n += v.v1;
n += v.v2;
n += v.v3;
}
return n;
}
C2 generates:
L_5 112: movslq %r8d,%r10
115: shl $0x4,%r10 ;*aaload {reexecute=0 rethrow=0
return_oop=0 return_vt=0}
; - Array_idx::count at 18 (line 26)
119: add 0x10(%r11,%r10,1),%eax
124: add 0x14(%r11,%r10,1),%eax
129: add 0x18(%r11,%r10,1),%eax
134: add 0x20(%r11,%r10,1),%eax
139: add 0x24(%r11,%r10,1),%eax
144: add 0x28(%r11,%r10,1),%eax
149: add 0x30(%r11,%r10,1),%eax
154: add 0x34(%r11,%r10,1),%eax
159: add 0x38(%r11,%r10,1),%eax
164: add 0x40(%r11,%r10,1),%eax
169: add 0x44(%r11,%r10,1),%eax
174: add 0x48(%r11,%r10,1),%eax ;*iadd {reexecute=0
rethrow=0 return_oop=0 return_vt=0}
; - Array_idx::count at 43 (line 29)
179: add $0x4,%r8d ;*iinc {reexecute=0 rethrow=0
return_oop=0 return_vt=0}
; - Array_idx::count at 45 (line 26)
183: cmp %r9d,%r8d
186: jl L_5 (112) ^ ;*if_icmpge {reexecute=0
rethrow=0 return_oop=0 return_vt=0}
; - Array_idx::count at 13 (line 26)
If the array is packed with 12 bytes per element, we just need to change
179 to
179: add (12 * 4), %r10
cmp %r9, %r10 ;; %r9 =
(array.length * 12)
jl 119
It will be faster, and use one fewer register!
We need to pre-scale the loop limit (%r9d) to V.length * 12 (in %r9),
but we have enough bits on 64-bit platforms.
- Ioi
On 10/2/18 2:24 AM, John Rose wrote:
> On Oct 2, 2018, at 12:55 AM, David Simms <david.simms at oracle.com> wrote:
>> What is it you are asking here ? That you want to 20 bytes rather than pow 2 aligned ?
> 20 bytes would be represented as 5*4, where the 5 comes from the 1/3/5/7 selector,
> and 4 is the power of two.
>
> All odd sizes up to 7, all even sizes up to 16, all multiples of four up to 32 bytes,
> and all multiples of eight up to 64 bytes are representable in this scheme.
>
> The first several byte sizes not exactly representable are: 9(10), 11(12), 13(14), 15(16),
> 17,18,19(20), 21,22,23(24), 25,26,27(28), 29,30,31,(32). The parenthesized numbers
> are the next higher representable sizes. As you can see the error is kept small.
>
> The layout helper would have to be tweaked to allow the 1/3/5/7 selector, and users
> of the layout helper which process value arrays would have to take it into account.
>
> A good encoding for the selector would be:
>
> enum sel1357 { sel1=0, sel3=1, sel5=2, sel7=3 };
>
> sel1357 encode_sel(int sel) {
> sel1357 sel_code = (sel1357)((sel - 1) >> 1);
> assert(sel == 1 + (sel_code << 1));
> return sel_code;
> }
> int decode_sel(sel1357 sel_code) {
> int sel = 1 + ((int)sel_code << 1);
> }
>
> The logic for choosing the best selector and shift could be:
>
> int decode_shift_and_sel(int size, sel1357& return_sel) {
> int best_err = size, best_sel = 0, best_shift = 0;
> for (int sel = 1; sel <= 7; sel += 2) {
> int shift = upper_log2((size + sel - 1) / sel);
> int err = (1 << shift) * sel;
> if (err < best_err) { best_err = err; best_sel = sel; best_shift = shift; }
> }
> return_sel = encode_sel(best_sel);
> return best_shift;
> }
>
> (There are surely bugs in this, but you get the idea. A dead-reckoning
> algorithm could do the job faster, without the loop, but it would be harder
> to read.)
>
>> A long time ago we discussed that mul was too expensive compared to shift, so that encoding is intentional (addressing uses shift)
> No mul is needed, ever. In most cases the JIT can see the array type and
> emit suitable shift/add sequences to produce 3 (1+2), 5 (1+4), or 7 (8-1).
>
> Even the interpreter, or polymorphic code, can produce suitable ALU
> operations that are competitive with mul, if only those four multipliers
> are possible.
>
> int mul_by_sel(intptr_t x, sel1357 sel) {
> switch (sel) {
> case sel1: return x;
> case sel3: return x + (x << 1);
> case sel5: return x + (x << 2);
> case sel7: return (x << 3) - x;
> }
> }
>
> Similar points can be made for a one-bit 1/3 selector, with more footprint
> and somewhat simpler decoding.
>
> int mul_by_sel(intptr_t x, sel13 sel) {
> return x + (-(int)sel & (x << 1));
> }
>
> In many cases, the value types will have true power of two size, so that
> the selector will be "sel1" (unit multiplier). If that's common enough, then
> a speculation might be in order:
>
> int mul_by_sel(intptr_t x, sel1357 sel) {
> if (sel == sel1) return x;
> int result = (x << (int)sel); // needs fixup for sel7
> if (sel == sel7) x = -x; // fixup
> result += x;
> return result;
> }
>
> int mul_by_sel(intptr_t x, sel13 sel) {
> return sel == sel1 ? x : x + (x << 1);
> }
>
>
More information about the valhalla-dev
mailing list