[foreign-memaccess+abi] RFR: Split foreign vector load and store by null or not null base [v2]
Paul Sandoz
psandoz at openjdk.org
Wed Aug 31 21:42:30 UTC 2022
On Mon, 29 Aug 2022 20:53:43 GMT, Radoslaw Smogura <duke at openjdk.org> wrote:
>> Radoslaw Smogura has updated the pull request incrementally with one additional commit since the last revision:
>>
>> Add unswitching to masked vector operations
>> Add benchmark covering this.
>>
>> After
>> ```
>> Benchmark (size) Mode Cnt Score Error Units
>> MemorySegmentMaskedVectorAccess.arrayCopy 1024 avgt 10 16.700 ± 0.612 ns/op
>> MemorySegmentMaskedVectorAccess.directSegments 1024 avgt 10 80.429 ± 2.897 ns/op
>> MemorySegmentMaskedVectorAccess.heapSegments 1024 avgt 10 25.528 ± 0.296 ns/op
>> MemorySegmentMaskedVectorAccess.pollutedSegments2 1024 avgt 10 122.809 ± 0.894 ns/op
>> MemorySegmentMaskedVectorAccess.pollutedSegments3 1024 avgt 10 252.930 ± 4.623 ns/op
>> MemorySegmentMaskedVectorAccess.pollutedSegments4 1024 avgt 10 451.579 ± 6.429 ns/op
>> MemorySegmentMaskedVectorAccess.pollutedSegments5 1024 avgt 10 446.500 ± 39.156 ns/op
>> ```
>>
>> Before
>> ```
>> Benchmark (size) Mode Cnt Score Error Units
>> MemorySegmentMaskedVectorAccess.arrayCopy 1024 avgt 10 21.089 ± 0.219 ns/op
>> MemorySegmentMaskedVectorAccess.directSegments 1024 avgt 10 81.384 ± 1.008 ns/op
>> MemorySegmentMaskedVectorAccess.heapSegments 1024 avgt 10 25.626 ± 0.522 ns/op
>> MemorySegmentMaskedVectorAccess.pollutedSegments2 1024 avgt 10 217.733 ± 5.467 ns/op
>> MemorySegmentMaskedVectorAccess.pollutedSegments3 1024 avgt 10 441.045 ± 9.749 ns/op
>> MemorySegmentMaskedVectorAccess.pollutedSegments4 1024 avgt 10 522.613 ± 104.997 ns/op
>> MemorySegmentMaskedVectorAccess.pollutedSegments5 1024 avgt 10 449.814 ± 8.203 ns/op
>> ```
>
> I think I would need help, as I've found that large number of deoptimizations happens when I execute following code (both in case of Java split and VM split):
>
> public static int test3(MemorySegment in, MemorySegment out, MemorySegment out2, byte[] arr) {
> long sz = in.byteSize();
> var zero = ByteVector.zero(SPECIES_BYTE);
> for (long i = 0; i < SPECIES_BYTE.loopBound(in.byteSize()); i += SPECIES_BYTE.vectorByteSize()) {
> var v1 = ByteVector.fromMemorySegment(SPECIES_BYTE, in, i, ByteOrder.nativeOrder());
> // arr[i] = (byte) 0;
> v1.intoMemorySegment(out, i, ByteOrder.nativeOrder());
> }
>
> return 0;
> }
>
> public static void main(String[] args) throws Exception {
> var session = MemorySession.openConfined();
> MemorySegment heapIn = MemorySegment.ofArray(new byte[size]);
> MemorySegment heapOu = MemorySegment.ofArray(new byte[size]);
>
> MemorySegment directIn = MemorySegment.allocateNative(size, session);
> MemorySegment directOu = MemorySegment.allocateNative(size, session);
>
> for (int i=0; i < 30_000; i++) {
> test3(heapIn, heapOu, heapOu, (byte[]) heapOu.array().get());
> test3(directIn, directOu, directOu, (byte[]) heapOu.array().get());
> }
> }
>
> In compilation log I have huge amount of entries like
>
> <deoptimized thread='31917' reason='constraint' pc='0x00007fffe134cdc7' compile_id='895' compiler='c1' level='3'>
> <jvms bci='78' method='eu.smogura.panama.tests.vectorscopy.Main test3 (Ljava/lang/foreign/MemorySegment;Ljava/lang/foreign/MemorySegment;Ljava/lang/foreign/MemorySegment;[B)I' bytes='83' count='6061' backedge_count='39955677' iicount='6061' decompiles='95' profile_predicate_traps='100' overflow_recompiles='92'/>
>
> The VM options I use
>
> "-XX:+UnlockDiagnosticVMOptions", "-XX:CompileCommand=dontinline,\*::test3\*", "-XX:+LogCompilation",
>
>
> Other thing I noticed, there's huge number of _PhaseIdealLoop_ phases (hits allowed maximum) and it create 64 CountedLoopNodes for main part of loop (there's should be at most 4 unswitched branches).
>
> I wonder if someone could help me with this concern?
@rsmogura you need some guidance from HotSpot engineers such as @iwanowww
-------------
PR: https://git.openjdk.org/panama-foreign/pull/711
More information about the panama-dev
mailing list