[vectorIntrinsics] RFR: Add utf8 decoding benchmarks

Thu Nov 26 10:21:10 UTC 2020

On Wed, 25 Nov 2020 23:47:56 GMT, Paul Sandoz <psandoz at openjdk.org> wrote:

>> This benchmark has identified a few issues:
>> 
>> 1. Operations on species of byte/64 are not intrinsic.
>> 2. `ShortVector` from/to char[] are not intrinsic (i suspected this to be the case).
>> 
>> For now:
>> 
>> 1) can be unblocked by focusing on byte/128 and short/256.
>> 2) can be unblocked with the following patch (so it appears, but needs more detailed review) or using `short[]`.
>> 
>> diff --git a/src/hotspot/share/opto/vectorIntrinsics.cpp b/src/hotspot/share/opto/vectorIntrinsics.cpp
>> index db7b69a9137..399130b0e45 100644
>> --- a/src/hotspot/share/opto/vectorIntrinsics.cpp
>> +++ b/src/hotspot/share/opto/vectorIntrinsics.cpp
>> @@ -624,7 +624,10 @@ bool LibraryCallKit::inline_vector_mem_operation(bool is_store) {
>>    // Handle loading masks.
>>    // If there is no consistency between array and vector element types, it must be special byte array case or loading masks
>>    if (arr_type != NULL && !using_byte_array && elem_bt != arr_type->elem()->array_element_basic_type() && !is_mask) {
>> -    return false;
>> +    if (elem_bt == T_SHORT && arr_type->elem()->array_element_basic_type() == T_CHAR) {
>> +    } else {
>> +      return false;
>> +    }
>>    }
>> 
>> In general, to found the causes of issues i recommend extracting out vector sub-expressions and placing in separate benchmarks. It's easier to analyze the code that is generated.
>> 
>> The benchmark is also storing vectors/shuffles on the heap. Instead i recommend storing such data in compatible arrays, then loading into vector instances held in local variables.
>
> Clarifications on byte/64 operations are not intrinsic. I was just focusing on the `decodeVectorASCII` benchmark. I think it is specifically due to the mask operations. 
> 
> The following produces good code (with the patch for `intoCharArray`):
>     private static void decodeArrayVectorizedASCII(ByteBuffer src, CharBuffer dst) {
>         byte[] sa = src.array();
>         int sp = src.arrayOffset() + src.position();
>         int sl = src.arrayOffset() + src.limit();
> 
>         char[] da = dst.array();
>         int dp = dst.arrayOffset() + dst.position();
>         int dl = dst.arrayOffset() + dst.limit();
> 
>         // Vectorized loop
>         // @@@ Calculate min upper bound from src and dst
> 
>         for (; sp <= sl - B128.length(); sp += B128.length(), dp += S256.length()) {
>             var bytes = ByteVector.fromArray(B128, sa, sp);
> 
>             if (bytes.compare(VectorOperators.LT, (byte) 0x00).anyTrue())
>                 break;
> 
>             ((ShortVector) bytes.convertShape(VectorOperators.B2S, S256, 0)).intoCharArray(da, dp);
>         }
>         updatePositions(src, sp, dst, dp);
>     }

@PaulSandoz thank you for your detailed explanation. 

> In general, to found the causes of issues i recommend extracting out vector sub-expressions and placing in separate benchmarks. It's easier to analyze the code that is generated.

That should work for the ASCII case easily, but we would need to disable the verification in `DecodeBench.tearDownInvocation`. I am taking that this is acceptable when benchmarking and debugging local changes.

> The benchmark is also storing vectors/shuffles on the heap. Instead i recommend storing such data in compatible arrays, then loading into vector instances held in local variables.

Let me change that locally and see what numbers I get with that.

-------------

PR: https://git.openjdk.java.net/panama-vector/pull/26