IntVector.fromValues is not optimized away ?

Mon May 11 20:35:51 UTC 2020

I tried several different snippets with more or less success and found several other things that should be fixed. 

Option 1 
var zero = (IntVector)IntVector.SPECIES_64.zero(); 
var v1 = zero.withLane(0, i1).withLane(1, i3); 
var v2 = zero.withLane(0, i2).withLane(1, i4); 
var result = v1.lanewise(VectorOperators.XOR, v2); 
return result.lane(0) ^ result.lane(1); 

I get: 
0x00007fb74c33693c: mov 0x14(%rsi),%r11d 
0x00007fb74c336940: mov 0x10(%rsi),%r10d 
0x00007fb74c336944: mov 0x18(%rsi),%r9d 
0x00007fb74c336948: mov 0xc(%rsi),%r8d 
0x00007fb74c33694c: movabs $0x71963e868,%rcx ; {oop([I{0x000000071963e868})} 
0x00007fb74c336956: vmovq 0x10(%rcx),%xmm0 
0x00007fb74c33695b: vmovdqu %xmm0,%xmm1 
0x00007fb74c33695f: vpinsrd $0x0,%r8d,%xmm1,%xmm1 
0x00007fb74c336965: vpinsrd $0x0,%r10d,%xmm0,%xmm0 
0x00007fb74c33696b: vpinsrd $0x1,%r11d,%xmm1,%xmm1 
0x00007fb74c336971: vpinsrd $0x1,%r9d,%xmm0,%xmm0 ;*invokestatic extract {reexecute=0 rethrow=0 return_oop=0} 
; - jdk.incubator.vector.Int64Vector::laneHelper at 16 (line 482) 
; - jdk.incubator.vector.Int64Vector::lane at 36 (line 476) 
; - fr.umlv.vector.VectorizedHashCode$Data::hashCode2 at 67 (line 19) 
0x00007fb74c336977: vpxor %xmm0,%xmm1,%xmm0 ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} 
; - jdk.incubator.vector.IntVector::lanewiseTemplate at 244 (line 652) 
; - jdk.incubator.vector.Int64Vector::lanewise at 3 (line 277) 
; - jdk.incubator.vector.Int64Vector::lanewise at 3 (line 41) 
; - fr.umlv.vector.VectorizedHashCode$Data::hashCode2 at 53 (line 18) 
0x00007fb74c33697b: vmovd %xmm0,%eax 
0x00007fb74c33697f: vpextrd $0x1,%xmm0,%r10d 
0x00007fb74c336985: xor %r10d,%eax 

which is not that bad, but at the same time, it seems HotSpot is not able to see that 0x10(%rcx) is zero ? 

Option 2, 3 and 4: 
using one of 
var zero = IntVector.zero(SPECIES_64); 
var zero = (IntVector)IntVector.SPECIES_64.broadcast(0); 
var zero = IntVector.broadcast(SPECIES_64, 0); 
all generate a code that calls the runtime for HotSpot. It's an intrinsic, my hardware doesn't seems to have an instruction for it so 
a call to the HS runtime is generated which make it super inefficient. 
It's like if the backup java code of the instrinsics was not used (and not inlined). 

So i get a code like: 
0x00007fa57ff04f7f: movabs $0x719645ea0,%rdi ; {oop(a 'jdk/incubator/vector/IntVector$$Lambda$64+0x0000000800b71550'{0x0000000719645ea0})} 
0x00007fa57ff04f89: movabs $0x719632b80,%rsi ; {oop(a 'java/lang/Class'{0x0000000719632b80} = 'jdk/incubator/vector/Int64Vector')} 
0x00007fa57ff04f93: movabs $0x7ffd002a0,%rdx ; {oop(a 'java/lang/Class'{0x00000007ffd002a0} = int)} 
0x00007fa57ff04f9d: mov $0x2,%ecx 
0x00007fa57ff04fa2: xor %r8d,%r8d 
0x00007fa57ff04fa5: movabs $0x7196327c8,%r9 ; {oop(a 'jdk/incubator/vector/IntVector$IntSpecies'{0x00000007196327c8})} 
0x00007fa57ff04faf: callq 0x00007fa57843f400 ; ImmutableOopMap {rbp=Oop } 
;*invokestatic broadcastCoerced {reexecute=0 rethrow=0 

Option 5: 
Use re-interpret shape 
var zero = (IntVector)IntVector.SPECIES_128.zero(); 
var v = zero.withLane(0, i1).withLane(1, i2).withLane(2, i3).withLane(3, i4); 
var v1 = (IntVector)v.reinterpretShape(IntVector.SPECIES_64, 0); <-- here 
var v2 = (IntVector)v.reinterpretShape(IntVector.SPECIES_64, 1); <-- and here 
var result = v1.lanewise(VectorOperators.XOR, v2); 
return result.lane(0) ^ result.lane(1); 

Like with using fromValues, an array is still created and a bunch of weird code around VectorSupport$VectorPayload::getPayload 

Option 6: 
do the reduce using reduceLanes, 
var zero = (IntVector)IntVector.SPECIES_64.zero(); 
var v1 = zero.withLane(0, i1).withLane(1, i3); 
var v2 = zero.withLane(0, i2).withLane(1, i4); 
var result = v1.lanewise(VectorOperators.XOR, v2); 
return result.reduceLanes(VectorOperators.XOR); <-- here 

I get: 
0x00007f62db4e4f37: mov %rbp,0x10(%rsp) ;*synchronization entry 
; - fr.umlv.vector.VectorizedHashCode$Data::hashCode2 at -1 (line 38) 
0x00007f62db4e4f3c: mov 0x14(%rsi),%r11d 
0x00007f62db4e4f40: mov 0x10(%rsi),%r10d 
0x00007f62db4e4f44: mov 0x18(%rsi),%r9d 
0x00007f62db4e4f48: mov 0xc(%rsi),%r8d 
0x00007f62db4e4f4c: movabs $0x71963e868,%rcx ; {oop([I{0x000000071963e868})} 
0x00007f62db4e4f56: vmovq 0x10(%rcx),%xmm0 
0x00007f62db4e4f5b: vmovdqu %xmm0,%xmm1 
0x00007f62db4e4f5f: vpinsrd $0x0,%r8d,%xmm1,%xmm1 
0x00007f62db4e4f65: vpinsrd $0x0,%r10d,%xmm0,%xmm0 
0x00007f62db4e4f6b: vpinsrd $0x1,%r11d,%xmm1,%xmm1 
0x00007f62db4e4f71: vpinsrd $0x1,%r9d,%xmm0,%xmm0 
0x00007f62db4e4f77: vpxor %xmm0,%xmm1,%xmm0 
0x00007f62db4e4f7b: xor %r11d,%r11d 
0x00007f62db4e4f7e: vpshufd $0x1,%xmm0,%xmm2 
0x00007f62db4e4f83: vpxor %xmm0,%xmm2,%xmm2 
0x00007f62db4e4f87: vmovd %r11d,%xmm1 
0x00007f62db4e4f8c: vpxor %xmm1,%xmm2,%xmm2 
0x00007f62db4e4f90: vmovd %xmm2,%eax 

you can notice that after the first vpxor, you have two (not one) other vpxor, if my assembler fu is correct, it's a xor between the vector and 0 because the reduce is done using the neutral element 0 instead of in between the values inside the AVX register. 

if instead of using resudeLanes, i do the loop myself, it get the right code for reduceLane 
var zero = (IntVector)IntVector.SPECIES_64.zero(); 
var v1 = zero.withLane(0, i1).withLane(1, i3); 
var v2 = zero.withLane(0, i2).withLane(1, i4); 
var result = v1.lanewise(VectorOperators.XOR, v2); 
var acc = result.lane(0); 
for(var i = 1; i < IntVector.SPECIES_64.length(); i++) { <-- loop instead of reduceLanes 
acc = acc ^ result.lane(i); 
} 
return acc; 

I get: 
0x00007fa378331db7: mov %rbp,0x10(%rsp) ;*synchronization entry 
; - fr.umlv.vector.VectorizedHashCode$Data::hashCode2 at -1 (line 26) 
0x00007fa378331dbc: mov 0x14(%rsi),%r11d 
0x00007fa378331dc0: mov 0x10(%rsi),%r10d 
0x00007fa378331dc4: mov 0x18(%rsi),%r9d 
0x00007fa378331dc8: mov 0xc(%rsi),%r8d 
0x00007fa378331dcc: movabs $0x71963e868,%rcx ; {oop([I{0x000000071963e868})} 
0x00007fa378331dd6: vmovq 0x10(%rcx),%xmm0 
0x00007fa378331ddb: vmovdqu %xmm0,%xmm1 
0x00007fa378331ddf: vpinsrd $0x0,%r8d,%xmm1,%xmm1 
0x00007fa378331de5: vpinsrd $0x0,%r10d,%xmm0,%xmm0 
0x00007fa378331deb: vpinsrd $0x1,%r11d,%xmm1,%xmm1 
0x00007fa378331df1: vpinsrd $0x1,%r9d,%xmm0,%xmm0 ;*invokestatic extract {reexecute=0 rethrow=0 return_oop=0} 
; - jdk.incubator.vector.Int64Vector::laneHelper at 16 (line 482) 
; - jdk.incubator.vector.Int64Vector::lane at 30 (line 475) 
; - fr.umlv.vector.VectorizedHashCode$Data::hashCode2 at 88 (line 32) 
0x00007fa378331df7: vpxor %xmm0,%xmm1,%xmm0 ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0} 
; - jdk.incubator.vector.IntVector::lanewiseTemplate at 244 (line 652) 
; - jdk.incubator.vector.Int64Vector::lanewise at 3 (line 277) 
; - jdk.incubator.vector.Int64Vector::lanewise at 3 (line 41) 
; - fr.umlv.vector.VectorizedHashCode$Data::hashCode2 at 53 (line 29) 
0x00007fa378331dfb: vpextrd $0x1,%xmm0,%eax 
0x00007fa378331e01: vmovd %xmm0,%r10d 
0x00007fa378331e06: xor %r10d,%eax 

so for reduceLanes if think it's better to 
- looping using xor instead of vpxor 
- not use the neutral element 0 

regards, 
Rémi 

> De: "Paul Sandoz" <paul.sandoz at oracle.com>
> À: "Remi Forax" <forax at univ-mlv.fr>
> Cc: "panama-dev at openjdk.java.net'" <panama-dev at openjdk.java.net>
> Envoyé: Lundi 11 Mai 2020 20:02:14
> Objet: Re: IntVector.fromValues is not optimized away ?

> Hi Remi,

> For some reason this method does not defer to the fromArray equivalent.

> Can you try with the following patch?

> [
> http://cr.openjdk.java.net/~psandoz/panama/vector-from-values-using-from-array/webrev/
> |
> http://cr.openjdk.java.net/~psandoz/panama/vector-from-values-using-from-array/webrev/
> ]

> I shall also investigate further.

> Paul.

>> On May 9, 2020, at 11:52 AM, Remi Forax < [ mailto:forax at univ-mlv.fr |
>> forax at univ-mlv.fr ] > wrote:

>> Hi all,
>> this may be obvious but do we agree that IntVector.fromValues is not optimized
>> thus really create an array destroying any hope of perf ?

>> I'm trying to see the difference between

>> public int hashCode() {
>> return i1 ^ i2 ^ i3 ^ i4;
>> }

>> and

>> public int hashCode() {
>> var v1 = IntVector.fromValues(IntVector.SPECIES_64, i1, i3);
>> var v2 = IntVector.fromValues(IntVector.SPECIES_64, i2, i4);
>> var result = v1.lanewise(VectorOperators.XOR, v2);
>> return result.lane(0) ^ result.lane(1);
>> }

>> but taking a look to the generated assembly (below), the allocation of the two
>> arrays are still there,
>> too bad because the last 6 instructions are more or less what i was expecting.

>> 0x00007fbb383324dc: mov 0x14(%rsi),%r11d ;*getfield i3 {reexecute=0 rethrow=0
>> return_oop=0}
>> ; - fr.umlv.vector.VectorizedHashCode$Data::hashCode2 at 16 (line 14)
>> 0x00007fbb383324e0: mov 0xc(%rsi),%ebp
>> 0x00007fbb383324e3: mov 0x120(%r15),%r8
>> 0x00007fbb383324ea: mov %r8,%r10
>> 0x00007fbb383324ed: add $0x18,%r10
>> 0x00007fbb383324f1: cmp 0x130(%r15),%r10
>> 0x00007fbb383324f8: jae 0x00007fbb383325db
>> 0x00007fbb383324fe: mov %r10,0x120(%r15)
>> 0x00007fbb38332505: prefetchw 0xc0(%r10)
>> 0x00007fbb3833250d: movq $0x1,(%r8)
>> 0x00007fbb38332514: prefetchw 0x100(%r10)
>> 0x00007fbb3833251c: movl $0x70cb1,0x8(%r8) ; {metadata({type array int})}
>> 0x00007fbb38332524: prefetchw 0x140(%r10)
>> 0x00007fbb3833252c: movl $0x2,0xc(%r8)
>> 0x00007fbb38332534: prefetchw 0x180(%r10)
>> 0x00007fbb3833253c: mov %ebp,0x10(%r8)
>> 0x00007fbb38332540: mov %r11d,0x14(%r8) ;*newarray {reexecute=0 rethrow=0
>> return_oop=0}
>> ; - java.util.Arrays::copyOf at 1 (line 3584)
>> ; - jdk.incubator.vector.IntVector::fromValues at 19 (line 553)
>> ; - fr.umlv.vector.VectorizedHashCode$Data::hashCode2 at 20 (line 14)
>> 0x00007fbb38332544: mov 0x18(%rsi),%r9d
>> 0x00007fbb38332548: mov 0x120(%r15),%rax ;*invokestatic extract {reexecute=0
>> rethrow=0 return_oop=0}
>> ; - jdk.incubator.vector.Int64Vector::laneHelper at 16 (line 482)
>> ; - jdk.incubator.vector.Int64Vector::lane at 36 (line 476)
>> ; - fr.umlv.vector.VectorizedHashCode$Data::hashCode2 at 64 (line 17)
>> 0x00007fbb3833254f: mov 0x10(%rsi),%ebp ;*getfield i2 {reexecute=0 rethrow=0
>> return_oop=0}
>> ; - fr.umlv.vector.VectorizedHashCode$Data::hashCode2 at 33 (line 15)
>> 0x00007fbb38332552: mov %rax,%r10
>> 0x00007fbb38332555: add $0x18,%r10
>> 0x00007fbb38332559: nopl 0x0(%rax)
>> 0x00007fbb38332560: cmp 0x130(%r15),%r10
>> 0x00007fbb38332567: jae 0x00007fbb3833260d
>> 0x00007fbb3833256d: mov %r10,0x120(%r15)
>> 0x00007fbb38332574: prefetchw 0xc0(%r10)
>> 0x00007fbb3833257c: movq $0x1,(%rax)
>> 0x00007fbb38332583: prefetchw 0x100(%r10)
>> 0x00007fbb3833258b: movl $0x70cb1,0x8(%rax) ; {metadata({type array int})}
>> 0x00007fbb38332592: prefetchw 0x140(%r10)
>> 0x00007fbb3833259a: movl $0x2,0xc(%rax)
>> 0x00007fbb383325a1: prefetchw 0x180(%r10)
>> 0x00007fbb383325a9: mov %ebp,0x10(%rax)
>> 0x00007fbb383325ac: mov %r9d,0x14(%rax) ;*newarray {reexecute=0 rethrow=0
>> return_oop=0}
>> ; - java.util.Arrays::copyOf at 1 (line 3584)
>> ; - jdk.incubator.vector.IntVector::fromValues at 19 (line 553)
>> ; - fr.umlv.vector.VectorizedHashCode$Data::hashCode2 at 44 (line 15)
>> 0x00007fbb383325b0: vmovq 0x10(%rax),%xmm0 ;*invokestatic extract {reexecute=0
>> rethrow=0 return_oop=0}
>> ; - jdk.incubator.vector.Int64Vector::laneHelper at 16 (line 482)
>> ; - jdk.incubator.vector.Int64Vector::lane at 36 (line 476)
>> ; - fr.umlv.vector.VectorizedHashCode$Data::hashCode2 at 64 (line 17)
>> 0x00007fbb383325b5: vpxor 0x10(%r8),%xmm0,%xmm0 ;*invokespecial <init>
>> {reexecute=0 rethrow=0 return_oop=0}
>> ; - jdk.internal.vm.vector.VectorSupport$Vector::<init>@2 (line 104)
>> ; - jdk.incubator.vector.Vector::<init>@2 (line 1122)
>> ; - jdk.incubator.vector.AbstractVector::<init>@2 (line 67)
>> ; - jdk.incubator.vector.IntVector::<init>@2 (line 55)
>> ; - jdk.incubator.vector.Int64Vector::<init>@2 (line 58)
>> ; - jdk.incubator.vector.Int64Vector::vectorFactory at 5 (line 169)
>> ; - jdk.incubator.vector.Int64Vector::vectorFactory at 2 (line 41)
>> ; - jdk.incubator.vector.IntVector$IntSpecies::vectorFactory at 5 (line 3718)
>> ; - jdk.incubator.vector.IntVector::fromValues at 22 (line 553)
>> ; - fr.umlv.vector.VectorizedHashCode$Data::hashCode2 at 44 (line 15)
>> 0x00007fbb383325bb: vpextrd $0x1,%xmm0,%r11d
>> 0x00007fbb383325c1: vmovd %xmm0,%eax
>> 0x00007fbb383325c5: xor %r11d,%eax
>> 0x00007fbb383325c8: vzeroupper

>> regards,
>> Rémi