IntVector.fromValues is not optimized away ?
Paul Sandoz
paul.sandoz at oracle.com
Tue May 12 00:14:25 UTC 2020
> On May 11, 2020, at 5:01 PM, Viswanathan, Sandhya <sandhya.viswanathan at intel.com> wrote:
>
> Hi Paul,
>
> The reduction rules are shared between auto-vectorizer and vector api.
> For the auto vectorizer, a variable scalar input comes to the codegen.
Ah, that makes sense.
> For the vector api, this scalar input is passed as the corresponding identity value for that operation.
>
> The last step in reduction could be easily removed by duplicating the reduction rules in the .ad file with immediate identity value as input.
> As that increases the .ad file size, we left this optimization out for majority of the operations.
> I can create an RFE bug entry for this and we can do this post merge based on feedback from Vladimir (Ivanov and Kozlov) and others.
> Only for floating point min/max reduction, we kept these special rules in code gen as the operation is very heavy.
I see, logging an RFE would be useful to capture this. Any duplication would be a shame given the increase that the Vector API brings to the ad files. I wonder if it's possible to teach the shared reduction code about operations using the identity value? Given its a constant in the Vector API case the correspond op might be elided, and might benefit the auto-vectorizer too for detected constant initial values.
Paul.
>
> Best Regards,
> Sandhya
>
> From: Paul Sandoz <paul.sandoz at oracle.com <mailto:paul.sandoz at oracle.com>>
> Sent: Monday, May 11, 2020 2:42 PM
> To: Remi Forax <forax at univ-mlv.fr <mailto:forax at univ-mlv.fr>>
> Cc: panama-dev at openjdk.java.net <mailto:panama-dev at openjdk.java.net>' <panama-dev at openjdk.java.net <mailto:panama-dev at openjdk.java.net>>; Viswanathan, Sandhya <sandhya.viswanathan at intel.com <mailto:sandhya.viswanathan at intel.com>>
> Subject: Re: IntVector.fromValues is not optimized away ?
>
> Thanks, very interesting. Sandhya and her colleagues are better qualified than I to comment accurately as to current behavior.
>
> It does give me hope there is a better way out for fromValues, and in fact we could do that in the template code at least for fixed sizes. But, as the vector lane count gets larger though the less useful it may become.
>
> FWIW I can reproduce the reduction issue:
>
> var v1 = IntVector.fromArray(IntVector.SPECIES_64, ia, 0);
> return v1.reduceLanes(VectorOperators.XOR);
>
> 0.07% │ 0x000000010b929d8f: vmovq 0x10(%r12,%r10,8),%xmm0
> 0.24% │ 0x000000010b929d96: xor %r11d,%r11d
> 7.16% │ 0x000000010b929d99: vpshufd $0x1,%xmm0,%xmm2
> 0.51% │ 0x000000010b929d9e: vpxor %xmm0,%xmm2,%xmm2
> 0.39% │ 0x000000010b929da2: vmovd %r11d,%xmm1
> 0.16% │ 0x000000010b929da7: vpxor %xmm1,%xmm2,%xmm2
> 7.64% │ 0x000000010b929dab: vmovd %xmm2,%edx
>
> I think it's bug in the code gen unnecessarily applying the identity value for the last stage of the reduction (I observe the same for & and + operations)
>
> Paul.
>
>
> On May 11, 2020, at 1:35 PM, forax at univ-mlv.fr <mailto:forax at univ-mlv.fr> wrote:
>
> I tried several different snippets with more or less success and found several other things that should be fixed.
>
> Option 1
> var zero = (IntVector)IntVector.SPECIES_64.zero();
> var v1 = zero.withLane(0, i1).withLane(1, i3);
> var v2 = zero.withLane(0, i2).withLane(1, i4);
> var result = v1.lanewise(VectorOperators.XOR, v2);
> return result.lane(0) ^ result.lane(1);
>
> I get:
> 0x00007fb74c33693c: mov 0x14(%rsi),%r11d
> 0x00007fb74c336940: mov 0x10(%rsi),%r10d
> 0x00007fb74c336944: mov 0x18(%rsi),%r9d
> 0x00007fb74c336948: mov 0xc(%rsi),%r8d
> 0x00007fb74c33694c: movabs $0x71963e868,%rcx ; {oop([I{0x000000071963e868})}
> 0x00007fb74c336956: vmovq 0x10(%rcx),%xmm0
> 0x00007fb74c33695b: vmovdqu %xmm0,%xmm1
> 0x00007fb74c33695f: vpinsrd $0x0,%r8d,%xmm1,%xmm1
> 0x00007fb74c336965: vpinsrd $0x0,%r10d,%xmm0,%xmm0
> 0x00007fb74c33696b: vpinsrd $0x1,%r11d,%xmm1,%xmm1
> 0x00007fb74c336971: vpinsrd $0x1,%r9d,%xmm0,%xmm0 ;*invokestatic extract {reexecute=0 rethrow=0 return_oop=0}
> ; - jdk.incubator.vector.Int64Vector::laneHelper at 16 (line 482)
> ; - jdk.incubator.vector.Int64Vector::lane at 36 (line 476)
> ; - fr.umlv.vector.VectorizedHashCode$Data::hashCode2 at 67 (line 19)
> 0x00007fb74c336977: vpxor %xmm0,%xmm1,%xmm0 ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0}
> ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 244 (line 652)
> ; - jdk.incubator.vector.Int64Vector::lanewise at 3 (line 277)
> ; - jdk.incubator.vector.Int64Vector::lanewise at 3 (line 41)
> ; - fr.umlv.vector.VectorizedHashCode$Data::hashCode2 at 53 (line 18)
> 0x00007fb74c33697b: vmovd %xmm0,%eax
> 0x00007fb74c33697f: vpextrd $0x1,%xmm0,%r10d
> 0x00007fb74c336985: xor %r10d,%eax
>
> which is not that bad, but at the same time, it seems HotSpot is not able to see that 0x10(%rcx) is zero ?
>
> Option 2, 3 and 4:
> using one of
> var zero = IntVector.zero(SPECIES_64);
> var zero = (IntVector)IntVector.SPECIES_64.broadcast(0);
> var zero = IntVector.broadcast(SPECIES_64, 0);
> all generate a code that calls the runtime for HotSpot. It's an intrinsic, my hardware doesn't seems to have an instruction for it so
> a call to the HS runtime is generated which make it super inefficient.
> It's like if the backup java code of the instrinsics was not used (and not inlined).
>
> So i get a code like:
> 0x00007fa57ff04f7f: movabs $0x719645ea0,%rdi ; {oop(a 'jdk/incubator/vector/IntVector$$Lambda$64+0x0000000800b71550'{0x0000000719645ea0})}
> 0x00007fa57ff04f89: movabs $0x719632b80,%rsi ; {oop(a 'java/lang/Class'{0x0000000719632b80} = 'jdk/incubator/vector/Int64Vector')}
> 0x00007fa57ff04f93: movabs $0x7ffd002a0,%rdx ; {oop(a 'java/lang/Class'{0x00000007ffd002a0} = int)}
> 0x00007fa57ff04f9d: mov $0x2,%ecx
> 0x00007fa57ff04fa2: xor %r8d,%r8d
> 0x00007fa57ff04fa5: movabs $0x7196327c8,%r9 ; {oop(a 'jdk/incubator/vector/IntVector$IntSpecies'{0x00000007196327c8})}
> 0x00007fa57ff04faf: callq 0x00007fa57843f400 ; ImmutableOopMap {rbp=Oop }
> ;*invokestatic broadcastCoerced {reexecute=0 rethrow=0
>
> Option 5:
> Use re-interpret shape
> var zero = (IntVector)IntVector.SPECIES_128.zero();
> var v = zero.withLane(0, i1).withLane(1, i2).withLane(2, i3).withLane(3, i4);
> var v1 = (IntVector)v.reinterpretShape(IntVector.SPECIES_64, 0); <-- here
> var v2 = (IntVector)v.reinterpretShape(IntVector.SPECIES_64, 1); <-- and here
> var result = v1.lanewise(VectorOperators.XOR, v2);
> return result.lane(0) ^ result.lane(1);
>
> Like with using fromValues, an array is still created and a bunch of weird code around VectorSupport$VectorPayload::getPayload
>
> Option 6:
> do the reduce using reduceLanes,
> var zero = (IntVector)IntVector.SPECIES_64.zero();
> var v1 = zero.withLane(0, i1).withLane(1, i3);
> var v2 = zero.withLane(0, i2).withLane(1, i4);
> var result = v1.lanewise(VectorOperators.XOR, v2);
> return result.reduceLanes(VectorOperators.XOR); <-- here
>
> I get:
> 0x00007f62db4e4f37: mov %rbp,0x10(%rsp) ;*synchronization entry
> ; - fr.umlv.vector.VectorizedHashCode$Data::hashCode2 at -1 (line 38)
> 0x00007f62db4e4f3c: mov 0x14(%rsi),%r11d
> 0x00007f62db4e4f40: mov 0x10(%rsi),%r10d
> 0x00007f62db4e4f44: mov 0x18(%rsi),%r9d
> 0x00007f62db4e4f48: mov 0xc(%rsi),%r8d
> 0x00007f62db4e4f4c: movabs $0x71963e868,%rcx ; {oop([I{0x000000071963e868})}
> 0x00007f62db4e4f56: vmovq 0x10(%rcx),%xmm0
> 0x00007f62db4e4f5b: vmovdqu %xmm0,%xmm1
> 0x00007f62db4e4f5f: vpinsrd $0x0,%r8d,%xmm1,%xmm1
> 0x00007f62db4e4f65: vpinsrd $0x0,%r10d,%xmm0,%xmm0
> 0x00007f62db4e4f6b: vpinsrd $0x1,%r11d,%xmm1,%xmm1
> 0x00007f62db4e4f71: vpinsrd $0x1,%r9d,%xmm0,%xmm0
> 0x00007f62db4e4f77: vpxor %xmm0,%xmm1,%xmm0
> 0x00007f62db4e4f7b: xor %r11d,%r11d
> 0x00007f62db4e4f7e: vpshufd $0x1,%xmm0,%xmm2
> 0x00007f62db4e4f83: vpxor %xmm0,%xmm2,%xmm2
> 0x00007f62db4e4f87: vmovd %r11d,%xmm1
> 0x00007f62db4e4f8c: vpxor %xmm1,%xmm2,%xmm2
> 0x00007f62db4e4f90: vmovd %xmm2,%eax
>
> you can notice that after the first vpxor, you have two (not one) other vpxor, if my assembler fu is correct, it's a xor between the vector and 0 because the reduce is done using the neutral element 0 instead of in between the values inside the AVX register.
>
> if instead of using resudeLanes, i do the loop myself, it get the right code for reduceLane
> var zero = (IntVector)IntVector.SPECIES_64.zero();
> var v1 = zero.withLane(0, i1).withLane(1, i3);
> var v2 = zero.withLane(0, i2).withLane(1, i4);
> var result = v1.lanewise(VectorOperators.XOR, v2);
> var acc = result.lane(0);
> for(var i = 1; i < IntVector.SPECIES_64.length(); i++) { <-- loop instead of reduceLanes
> acc = acc ^ result.lane(i);
> }
> return acc;
>
> I get:
> 0x00007fa378331db7: mov %rbp,0x10(%rsp) ;*synchronization entry
> ; - fr.umlv.vector.VectorizedHashCode$Data::hashCode2 at -1 (line 26)
> 0x00007fa378331dbc: mov 0x14(%rsi),%r11d
> 0x00007fa378331dc0: mov 0x10(%rsi),%r10d
> 0x00007fa378331dc4: mov 0x18(%rsi),%r9d
> 0x00007fa378331dc8: mov 0xc(%rsi),%r8d
> 0x00007fa378331dcc: movabs $0x71963e868,%rcx ; {oop([I{0x000000071963e868})}
> 0x00007fa378331dd6: vmovq 0x10(%rcx),%xmm0
> 0x00007fa378331ddb: vmovdqu %xmm0,%xmm1
> 0x00007fa378331ddf: vpinsrd $0x0,%r8d,%xmm1,%xmm1
> 0x00007fa378331de5: vpinsrd $0x0,%r10d,%xmm0,%xmm0
> 0x00007fa378331deb: vpinsrd $0x1,%r11d,%xmm1,%xmm1
> 0x00007fa378331df1: vpinsrd $0x1,%r9d,%xmm0,%xmm0 ;*invokestatic extract {reexecute=0 rethrow=0 return_oop=0}
> ; - jdk.incubator.vector.Int64Vector::laneHelper at 16 (line 482)
> ; - jdk.incubator.vector.Int64Vector::lane at 30 (line 475)
> ; - fr.umlv.vector.VectorizedHashCode$Data::hashCode2 at 88 (line 32)
> 0x00007fa378331df7: vpxor %xmm0,%xmm1,%xmm0 ;*invokestatic binaryOp {reexecute=0 rethrow=0 return_oop=0}
> ; - jdk.incubator.vector.IntVector::lanewiseTemplate at 244 (line 652)
> ; - jdk.incubator.vector.Int64Vector::lanewise at 3 (line 277)
> ; - jdk.incubator.vector.Int64Vector::lanewise at 3 (line 41)
> ; - fr.umlv.vector.VectorizedHashCode$Data::hashCode2 at 53 (line 29)
> 0x00007fa378331dfb: vpextrd $0x1,%xmm0,%eax
> 0x00007fa378331e01: vmovd %xmm0,%r10d
> 0x00007fa378331e06: xor %r10d,%eax
>
> so for reduceLanes if think it's better to
> - looping using xor instead of vpxor
> - not use the neutral element 0
>
> regards,
> Rémi
>
>
> De: "Paul Sandoz" <paul.sandoz at oracle.com <mailto:paul.sandoz at oracle.com>>
> À: "Remi Forax" <forax at univ-mlv.fr <mailto:forax at univ-mlv.fr>>
> Cc: "panama-dev at openjdk.java.net <mailto:panama-dev at openjdk.java.net>'" <panama-dev at openjdk.java.net <mailto:panama-dev at openjdk.java.net>>
> Envoyé: Lundi 11 Mai 2020 20:02:14
> Objet: Re: IntVector.fromValues is not optimized away ?
> Hi Remi,
>
> For some reason this method does not defer to the fromArray equivalent.
>
> Can you try with the following patch?
>
> http://cr.openjdk.java.net/~psandoz/panama/vector-from-values-using-from-array/webrev/ <http://cr.openjdk.java.net/~psandoz/panama/vector-from-values-using-from-array/webrev/>
>
> I shall also investigate further.
>
> Paul.
>
> On May 9, 2020, at 11:52 AM, Remi Forax <forax at univ-mlv.fr <mailto:forax at univ-mlv.fr>> wrote:
>
> Hi all,
> this may be obvious but do we agree that IntVector.fromValues is not optimized thus really create an array destroying any hope of perf ?
>
> I'm trying to see the difference between
>
> public int hashCode() {
> return i1 ^ i2 ^ i3 ^ i4;
> }
>
> and
>
> public int hashCode() {
> var v1 = IntVector.fromValues(IntVector.SPECIES_64, i1, i3);
> var v2 = IntVector.fromValues(IntVector.SPECIES_64, i2, i4);
> var result = v1.lanewise(VectorOperators.XOR, v2);
> return result.lane(0) ^ result.lane(1);
> }
>
> but taking a look to the generated assembly (below), the allocation of the two arrays are still there,
> too bad because the last 6 instructions are more or less what i was expecting.
>
>
> 0x00007fbb383324dc: mov 0x14(%rsi),%r11d ;*getfield i3 {reexecute=0 rethrow=0 return_oop=0}
> ; - fr.umlv.vector.VectorizedHashCode$Data::hashCode2 at 16 (line 14)
> 0x00007fbb383324e0: mov 0xc(%rsi),%ebp
> 0x00007fbb383324e3: mov 0x120(%r15),%r8
> 0x00007fbb383324ea: mov %r8,%r10
> 0x00007fbb383324ed: add $0x18,%r10
> 0x00007fbb383324f1: cmp 0x130(%r15),%r10
> 0x00007fbb383324f8: jae 0x00007fbb383325db
> 0x00007fbb383324fe: mov %r10,0x120(%r15)
> 0x00007fbb38332505: prefetchw 0xc0(%r10)
> 0x00007fbb3833250d: movq $0x1,(%r8)
> 0x00007fbb38332514: prefetchw 0x100(%r10)
> 0x00007fbb3833251c: movl $0x70cb1,0x8(%r8) ; {metadata({type array int})}
> 0x00007fbb38332524: prefetchw 0x140(%r10)
> 0x00007fbb3833252c: movl $0x2,0xc(%r8)
> 0x00007fbb38332534: prefetchw 0x180(%r10)
> 0x00007fbb3833253c: mov %ebp,0x10(%r8)
> 0x00007fbb38332540: mov %r11d,0x14(%r8) ;*newarray {reexecute=0 rethrow=0 return_oop=0}
> ; - java.util.Arrays::copyOf at 1 (line 3584)
> ; - jdk.incubator.vector.IntVector::fromValues at 19 (line 553)
> ; - fr.umlv.vector.VectorizedHashCode$Data::hashCode2 at 20 (line 14)
> 0x00007fbb38332544: mov 0x18(%rsi),%r9d
> 0x00007fbb38332548: mov 0x120(%r15),%rax ;*invokestatic extract {reexecute=0 rethrow=0 return_oop=0}
> ; - jdk.incubator.vector.Int64Vector::laneHelper at 16 (line 482)
> ; - jdk.incubator.vector.Int64Vector::lane at 36 (line 476)
> ; - fr.umlv.vector.VectorizedHashCode$Data::hashCode2 at 64 (line 17)
> 0x00007fbb3833254f: mov 0x10(%rsi),%ebp ;*getfield i2 {reexecute=0 rethrow=0 return_oop=0}
> ; - fr.umlv.vector.VectorizedHashCode$Data::hashCode2 at 33 (line 15)
> 0x00007fbb38332552: mov %rax,%r10
> 0x00007fbb38332555: add $0x18,%r10
> 0x00007fbb38332559: nopl 0x0(%rax)
> 0x00007fbb38332560: cmp 0x130(%r15),%r10
> 0x00007fbb38332567: jae 0x00007fbb3833260d
> 0x00007fbb3833256d: mov %r10,0x120(%r15)
> 0x00007fbb38332574: prefetchw 0xc0(%r10)
> 0x00007fbb3833257c: movq $0x1,(%rax)
> 0x00007fbb38332583: prefetchw 0x100(%r10)
> 0x00007fbb3833258b: movl $0x70cb1,0x8(%rax) ; {metadata({type array int})}
> 0x00007fbb38332592: prefetchw 0x140(%r10)
> 0x00007fbb3833259a: movl $0x2,0xc(%rax)
> 0x00007fbb383325a1: prefetchw 0x180(%r10)
> 0x00007fbb383325a9: mov %ebp,0x10(%rax)
> 0x00007fbb383325ac: mov %r9d,0x14(%rax) ;*newarray {reexecute=0 rethrow=0 return_oop=0}
> ; - java.util.Arrays::copyOf at 1 (line 3584)
> ; - jdk.incubator.vector.IntVector::fromValues at 19 (line 553)
> ; - fr.umlv.vector.VectorizedHashCode$Data::hashCode2 at 44 (line 15)
> 0x00007fbb383325b0: vmovq 0x10(%rax),%xmm0 ;*invokestatic extract {reexecute=0 rethrow=0 return_oop=0}
> ; - jdk.incubator.vector.Int64Vector::laneHelper at 16 (line 482)
> ; - jdk.incubator.vector.Int64Vector::lane at 36 (line 476)
> ; - fr.umlv.vector.VectorizedHashCode$Data::hashCode2 at 64 (line 17)
> 0x00007fbb383325b5: vpxor 0x10(%r8),%xmm0,%xmm0 ;*invokespecial <init> {reexecute=0 rethrow=0 return_oop=0}
> ; - jdk.internal.vm.vector.VectorSupport$Vector::<init>@2 (line 104)
> ; - jdk.incubator.vector.Vector::<init>@2 (line 1122)
> ; - jdk.incubator.vector.AbstractVector::<init>@2 (line 67)
> ; - jdk.incubator.vector.IntVector::<init>@2 (line 55)
> ; - jdk.incubator.vector.Int64Vector::<init>@2 (line 58)
> ; - jdk.incubator.vector.Int64Vector::vectorFactory at 5 (line 169)
> ; - jdk.incubator.vector.Int64Vector::vectorFactory at 2 (line 41)
> ; - jdk.incubator.vector.IntVector$IntSpecies::vectorFactory at 5 (line 3718)
> ; - jdk.incubator.vector.IntVector::fromValues at 22 (line 553)
> ; - fr.umlv.vector.VectorizedHashCode$Data::hashCode2 at 44 (line 15)
> 0x00007fbb383325bb: vpextrd $0x1,%xmm0,%r11d
> 0x00007fbb383325c1: vmovd %xmm0,%eax
> 0x00007fbb383325c5: xor %r11d,%eax
> 0x00007fbb383325c8: vzeroupper
>
> regards,
> Rémi
>
>
>
More information about the panama-dev
mailing list